[Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline) by tzhouam · Pull Request #763 · vllm-project/vllm-omni

tzhouam · 2026-01-13T06:31:44Z

Purpose

This PR aims to support GLM-Image model and rebase to v0.14.0 supporting AR offline inference.

Installation

# init uv env
uv venv --python 3.12 --seed
source .venv/bin/activate

#install vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 9273a427b5ec1491f7e8420c8fdab8d203843b75
VLLM_USE_PRECOMPILED=1 uv pip install --editable .

#return to home dir
cd ..

#install vllm omni
git clone https://github.com/vllm-project/vllm-omni.git
git checkout -b dev/rebase_0.14.0 remotes/origin/dev/rebase-0.14.0
git checkout f269e0e453b5a000209fc22b59ce99f6c6bb0e00
uv pip install -e .

#install up-to-date transformers and diffusors
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

GLM-Image

Test Plan

Tested on GLM-Image with commands:

cd examples/offline_inference/text_to_image
python3 text_to_image.py --model zai-org/GLM-Image

Test Result

Image:

log:

python3 text_to_image.py --model zai-org/GLM-Image --output GLM-Image_output.png
WARNING 01-14 11:37:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-14 11:37:48 [omni.py:122] Initializing stages for model: zai-org/GLM-Image
INFO 01-14 11:37:48 [initialization.py:35] No OmniTransferConfig provided
INFO 01-14 11:37:48 [omni_stage.py:108] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': 'zai-org/GLM-Image', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'enforce_eager': False, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-14 11:37:48 [omni.py:302] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-14 11:37:58 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-14 11:37:59 [omni_stage.py:435] Starting stage worker with model: zai-org/GLM-Image
[Stage-0] WARNING 01-14 11:38:00 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-14 11:38:00 [weight_utils.py:46] Using model weights format ['*']
[Stage-0] INFO 01-14 11:38:00 [diffusion_engine.py:231] Starting server...
[Stage-0] WARNING 01-14 11:38:10 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-14 11:38:11 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-14 11:38:12 [gpu_worker.py:273] Worker 0 created result MessageQueue
[Stage-0] INFO 01-14 11:38:12 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
[Stage-0] INFO 01-14 11:38:12 [vllm.py:632] Asynchronous scheduling is enabled.
[Stage-0] INFO 01-14 11:38:12 [vllm.py:639] Disabling NCCL for DP synchronization when using async scheduling.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-14 11:38:12 [gpu_worker.py:77] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-14 11:38:13 [weight_utils.py:46] Using model weights format ['*']
[Stage-0] INFO 01-14 11:38:13 [pipeline_glm_image.py:201] Loading GlmImageForConditionalGeneration (AR model)...
Loading weights: 100%|█████████████████████████████████████████████████████| 1011/1011 [00:02<00:00, 371.05it/s, Materializing param=model.vqmodel.quantize.embedding.weight]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[Stage-0] INFO 01-14 11:38:18 [pipeline_glm_image.py:214] Loading T5EncoderModel (glyph encoder)...
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████| 111/111 [00:00<00:00, 421.32it/s, Materializing param=shared.weight]
[Stage-0] INFO 01-14 11:38:18 [pipeline_glm_image.py:227] Loading AutoencoderKL (VAE)...
[Stage-0] INFO 01-14 11:38:19 [pipeline_glm_image.py:234] Loading GlmImageTransformer2DModel (DiT)...
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.47it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.52it/s]

[Stage-0] INFO 01-14 11:38:22 [diffusers_loader.py:214] Loading weights took 2.01 seconds
[Stage-0] INFO 01-14 11:38:22 [gpu_worker.py:100] Model loading took 33.0291 GiB and 9.767029 seconds
[Stage-0] INFO 01-14 11:38:22 [gpu_worker.py:105] Worker 0: Model loaded successfully.
[Stage-0] WARNING 01-14 11:38:22 [compile.py:27] Regional compilation skipped because the model does not define `_repeated_blocks`.
[Stage-0] INFO 01-14 11:38:22 [gpu_worker.py:126] Worker 0: Model compiled with torch.compile.
[Stage-0] INFO 01-14 11:38:22 [gpu_worker.py:409] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-14 11:38:22 [gpu_worker.py:332] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-14 11:38:22 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-14 11:38:22 [diffusion_engine.py:313] dummy run to warm up the model
[Stage-0] INFO 01-14 11:38:22 [pipeline_glm_image.py:859] Generating prior tokens with AR model...
[Stage-0] INFO 01-14 11:38:46 [pipeline_glm_image.py:868] Encoding prompt...
[Stage-0] INFO 01-14 11:38:46 [pipeline_glm_image.py:924] Starting denoising loop with 1 steps...
[Stage-0] INFO 01-14 11:38:47 [pipeline_glm_image.py:939] Decoding latents with VAE...
[Stage-0] INFO 01-14 11:38:47 [omni_stage.py:664] Max batch size: 1
INFO 01-14 11:38:47 [omni.py:295] [Orchestrator] Stage-0 reported ready
INFO 01-14 11:38:47 [omni.py:321] [Orchestrator] All stages initialized successfully

============================================================
Generation Configuration:
  Model: zai-org/GLM-Image
  Inference steps: 50
  Cache backend: None (no acceleration)
  Parallel configuration: tensor_parallel_size=1, ulysses_degree=1, ring_degree=1, cfg_parallel_size=1
  Image size: 1024x1024
============================================================

Adding requests:   0%|                                                                                                                                 | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-14 11:38:47 [omni_diffusion.py:115] Prepared 1 requests for generation.           | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-14 11:38:47 [pipeline_glm_image.py:859] Generating prior tokens with AR model...
[Stage-0] INFO 01-14 11:39:11 [pipeline_glm_image.py:868] Encoding prompt...
[Stage-0] INFO 01-14 11:39:11 [pipeline_glm_image.py:924] Starting denoising loop with 50 steps...
[Stage-0] INFO 01-14 11:39:18 [pipeline_glm_image.py:939] Decoding latents with VAE...
[Stage-0] INFO 01-14 11:39:19 [diffusion_engine.py:104] Generation completed successfully.
[Stage-0] INFO 01-14 11:39:19 [diffusion_engine.py:127] Post-processing completed in 0.0000 seconds
INFO 01-14 11:39:19 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-14 11:39:19 [log_utils.py:550]  'request_id': '0_e6dc7b01-6b4c-4927-9aef-548cb2cd30d2',
INFO 01-14 11:39:19 [log_utils.py:550]  'e2e_time_ms': 31775.473594665527,
INFO 01-14 11:39:19 [log_utils.py:550]  'e2e_tpt': 0.0,
INFO 01-14 11:39:19 [log_utils.py:550]  'e2e_total_tokens': 0,
INFO 01-14 11:39:19 [log_utils.py:550]  'transfers_total_time_ms': 0.0,
INFO 01-14 11:39:19 [log_utils.py:550]  'transfers_total_bytes': 0,
INFO 01-14 11:39:19 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 31727.26273536682,
INFO 01-14 11:39:19 [log_utils.py:550]                 'num_tokens_out': 0,
INFO 01-14 11:39:19 [log_utils.py:550]                 'num_tokens_in': 0}}}
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:31<00:00, 31.78s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-14 11:39:19 [omni.py:782] [Summary] {'e2e_requests': 1,███████████████████████████| 1/1 [00:31<00:00, 31.78s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-14 11:39:19 [omni.py:782]  'e2e_total_time_ms': 31776.662349700928,
INFO 01-14 11:39:19 [omni.py:782]  'e2e_sum_time_ms': 31775.473594665527,
INFO 01-14 11:39:19 [omni.py:782]  'e2e_total_tokens': 0,
INFO 01-14 11:39:19 [omni.py:782]  'e2e_avg_time_per_request_ms': 31775.473594665527,
INFO 01-14 11:39:19 [omni.py:782]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-14 11:39:19 [omni.py:782]  'wall_time_ms': 31776.662349700928,
INFO 01-14 11:39:19 [omni.py:782]  'final_stage_id': {'0_e6dc7b01-6b4c-4927-9aef-548cb2cd30d2': 0},
INFO 01-14 11:39:19 [omni.py:782]  'stages': [{'stage_id': 0,
INFO 01-14 11:39:19 [omni.py:782]              'requests': 1,
INFO 01-14 11:39:19 [omni.py:782]              'tokens': 0,
INFO 01-14 11:39:19 [omni.py:782]              'total_time_ms': 31775.777578353882,
INFO 01-14 11:39:19 [omni.py:782]              'avg_time_per_request_ms': 31775.777578353882,
INFO 01-14 11:39:19 [omni.py:782]              'avg_tokens_per_s': 0.0}],
INFO 01-14 11:39:19 [omni.py:782]  'transfers': []}
Adding requests:   0%|                                                                                                                                 | 0/1 [00:31<?, ?it/s]
[Stage-0] INFO 01-14 11:39:19 [omni_stage.py:673] Received shutdown signal
[Stage-0] INFO 01-14 11:39:19 [gpu_worker.py:364] Worker 0: Received shutdown message
[Stage-0] INFO 01-14 11:39:19 [gpu_worker.py:386] event loop terminated.
[Stage-0] INFO 01-14 11:39:19 [gpu_worker.py:417] Worker 0: Shutdown complete.
Total generation time: 34.5199 seconds (34519.91 ms)
INFO 01-14 11:39:22 [text_to_image.py:196] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_e6dc7b01-6b4c-4927-9aef-548cb2cd30d2', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt='a cup of coffee on the table', latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved generated image to GLM-Image_output.png

Qwen 3 Omni

Test Plan

Tested on Qwen 3 Omni Thinker with Cuda Graph using query "use_image".

Test Result

text:

Prompt:
<|im_start|>system
You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>What is the content of this image?<|im_end|>
<|im_start|>assistant

vllm_text_output:
Based on the provided image and its detailed crops, here is a description of its content:

This is a low-angle photograph capturing a beautiful spring scene in Japan. The main subject is the **Tokyo Skytree**, a famous telecommunications and observation tower, which is partially visible through the branches of a cherry blossom tree.

The composition uses the foreground elements to frame the tower:
*   **Cherry Blossoms (Sakura):** Pink cherry blossoms are in full bloom, with their delicate petals and dark branches creating a natural frame around the central structure. The focus appears to be on these flowers.
*   **Sky:** A vibrant, clear blue sky serves as the background, providing a strong contrast to the pink flowers and the white tower.

Overall, the image evokes a sense of serenity and seasonal beauty, blending the iconic modern architecture of Tokyo with the traditional and ephemeral beauty of cherry blossoms.

log:

WARNING 01-14 09:16:27 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
==================== 
 vllm version: 0.14.0rc1.dev533+g9273a427b 
 ====================
INFO 01-14 09:16:28 [omni.py:122] Initializing stages for model: /home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'interleaved'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section', 'interleaved'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'interleaved'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'mrope_section', 'interleaved'}
INFO 01-14 09:16:29 [initialization.py:232] Loaded OmniTransferConfig with 0 connector configurations
INFO 01-14 09:16:29 [omni_stage.py:108] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'devices': '0,1', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'Qwen3OmniMoeForConditionalGeneration', 'worker_cls': 'vllm_omni.worker.gpu_ar_worker.GPUARWorker', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.6, 'enforce_eager': False, 'trust_remote_code': True, 'engine_output_type': 'latent', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'hf_config_name': 'thinker_config', 'tensor_parallel_size': 2}, 'final_output': True, 'final_output_type': 'text', 'is_comprehension': True, 'default_sampling_params': {'temperature': 0.4, 'top_p': 0.9, 'top_k': 1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.05}}
INFO 01-14 09:16:29 [omni.py:302] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-14 09:16:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-14 09:16:48 [omni_stage.py:435] Starting stage worker with model: /home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section', 'mrope_interleaved'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section', 'mrope_interleaved'}
[Stage-0] INFO 01-14 09:16:48 [initialization.py:232] Loaded OmniTransferConfig with 0 connector configurations
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section', 'mrope_interleaved'}
[Stage-0] INFO 01-14 09:16:48 [model.py:530] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-0] INFO 01-14 09:16:48 [model.py:1545] Using max model len 65536
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section', 'mrope_interleaved'}
[Stage-0] INFO 01-14 09:17:07 [model.py:209] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-0] INFO 01-14 09:17:07 [model.py:1545] Using max model len 65536
[Stage-0] INFO 01-14 09:17:07 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 01-14 09:17:07 [vllm.py:632] Asynchronous scheduling is enabled.
[Stage-0] INFO 01-14 09:17:07 [vllm.py:639] Disabling NCCL for DP synchronization when using async scheduling.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
[Stage-0] WARNING 01-14 09:17:31 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:17:33 [core.py:97] Initializing a V1 LLM engine (v0.14.0rc1.dev533+g9273a427b) with config: model='/home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2969360) [Stage-0] WARNING 01-14 09:17:33 [multiproc_executor.py:880] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-0] WARNING 01-14 09:17:50 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-14 09:17:51 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'interleaved'}
[Stage-0] INFO 01-14 09:17:53 [parallel_state.py:1214] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:43737 backend=nccl
[Stage-0] INFO 01-14 09:17:54 [parallel_state.py:1214] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:43737 backend=nccl
[Stage-0] INFO 01-14 09:17:54 [pynccl.py:111] vLLM is using nccl==2.27.5
[Stage-0] INFO 01-14 09:17:56 [parallel_state.py:1425] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
[Stage-0] INFO 01-14 09:17:56 [parallel_state.py:1425] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'interleaved'}
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'interleaved'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:17:59 [gpu_model_runner.py:3808] Starting to load model /home/dyvm6xra/dyvm6xrauser49/project/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:17:59 [vllm.py:632] Asynchronous scheduling is enabled.
(Worker_TP1 pid=2970531) [Stage-0] WARNING 01-14 09:17:59 [qwen3_omni_moe_thinker.py:679] flash_attn is not available, the model may not yield the exactly same result as the transformers implementation in the audio tower part.
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:00 [vllm.py:632] Asynchronous scheduling is enabled.
(Worker_TP0 pid=2970530) [Stage-0] WARNING 01-14 09:18:00 [qwen3_omni_moe_thinker.py:679] flash_attn is not available, the model may not yield the exactly same result as the transformers implementation in the audio tower part.
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:18:00 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP1 pid=2970531) Warning: mrope_section check is disabled in Qwen2.5-Omni, this may cause errors, and should be restored in the future.
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:00 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=2970530) Warning: mrope_section check is disabled in Qwen2.5-Omni, this may cause errors, and should be restored in the future.
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:00 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:00<00:00, 47.50it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:00<00:00, 47.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 39.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 41.50it/s]
(Worker_TP0 pid=2970530) 
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:08 [qwen3_omni.py:1088] Loaded 1311 weights for Qwen3OmniMoe (stage=thinker)
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:08 [default_loader.py:291] Loading weights took 8.02 seconds
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:18:09 [qwen3_omni.py:1088] Loaded 1311 weights for Qwen3OmniMoe (stage=thinker)
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:09 [gpu_model_runner.py:3905] Model loading took 30.62 GiB memory and 9.317158 seconds
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:18:10 [gpu_model_runner.py:4715] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:10 [gpu_model_runner.py:4715] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:29 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/96e317a4f0/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:29 [backends.py:704] Dynamo bytecode transform time: 11.25 s
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:18:34 [backends.py:261] Cache the graph of compile range (1, 32768) for later use
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:18:34 [backends.py:261] Cache the graph of compile range (1, 32768) for later use
(Worker_TP0 pid=2970530) [Stage-0] WARNING 01-14 09:18:34 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/dyvm6xra/dyvm6xrauser49/project/server039/vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=384,device_name=NVIDIA_H800.json
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:19:10 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:23 [backends.py:278] Compiling a graph for compile range (1, 32768) takes 48.89 s
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:23 [monitor.py:34] torch.compile takes 60.14 s in total
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:25 [gpu_worker.py:358] Available KV cache memory: 6.01 GiB
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:19:25 [kv_cache_utils.py:1305] GPU KV cache size: 131,344 tokens
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:19:25 [kv_cache_utils.py:1310] Maximum concurrency for 65,536 tokens per request: 2.00x
(Worker_TP1 pid=2970531) (Worker_TP0 pid=2970530) 2026-01-14 09:19:26,024 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-14 09:19:26,024 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=2970531) (Worker_TP0 pid=2970530) 2026-01-14 09:19:26,143 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2026-01-14 09:19:26,143 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████| 51/51 [00:05<00:00,  8.79it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 10.95it/s]
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:37 [custom_all_reduce.py:216] Registering 9792 cuda graph addresses
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:19:37 [custom_all_reduce.py:216] Registering 9792 cuda graph addresses
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:37 [gpu_model_runner.py:4856] Graph capturing finished in 12 secs, took -1.00 GiB
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:19:38 [core.py:273] init engine (profile, create kv cache, warmup model) took 87.91 seconds
(EngineCore_DP0 pid=2969360) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(EngineCore_DP0 pid=2969360) [Stage-0] WARNING 01-14 09:19:38 [scheduler.py:171] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2969360) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=2969360) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(EngineCore_DP0 pid=2969360) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(EngineCore_DP0 pid=2969360) [Stage-0] INFO 01-14 09:19:41 [vllm.py:632] Asynchronous scheduling is enabled.
[Stage-0] INFO 01-14 09:19:41 [omni_llm.py:169] Supported_tasks: ['generate']
[Stage-0] INFO 01-14 09:19:41 [omni_stage.py:623] [Stage-0] vLLM profiler support detected (model_stage=thinker)
[Stage-0] INFO 01-14 09:19:41 [omni_stage.py:664] Max batch size: 1
INFO 01-14 09:19:41 [omni.py:295] [Orchestrator] Stage-0 reported ready
INFO 01-14 09:19:41 [omni.py:321] [Orchestrator] All stages initialized successfully
Adding requests:   0%|                                                                                                                                 | 0/1 [00:00<?, ?it/sThe image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:44 [mrope.py:452] Multimodal token idx changed!
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:19:44 [mrope.py:452] Multimodal token idx changed!
INFO 01-14 09:19:46 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-14 09:19:46 [log_utils.py:550]  'request_id': '0_6758425a-d69c-4ca6-a531-65bf815e00d7',
INFO 01-14 09:19:46 [log_utils.py:550]  'e2e_time_ms': 5368.500471115112,
INFO 01-14 09:19:46 [log_utils.py:550]  'e2e_tpt': 2.3566727265650185,
INFO 01-14 09:19:46 [log_utils.py:550]  'e2e_total_tokens': 2278,
INFO 01-14 09:19:46 [log_utils.py:550]  'transfers_total_time_ms': 0.0,
INFO 01-14 09:19:46 [log_utils.py:550]  'transfers_total_bytes': 0,
INFO 01-14 09:19:46 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 5137.80665397644,
INFO 01-14 09:19:46 [log_utils.py:550]                 'num_tokens_out': 185,
INFO 01-14 09:19:46 [log_utils.py:550]                 'num_tokens_in': 2093}}}
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.37s/req, est. speed stage-0 tok/s: 424.42, avg e2e_lat: 0.0ms]
INFO 01-14 09:19:46 [omni.py:782] [Summary] {'e2e_requests': 1,█████████████████████████| 1/1 [00:05<00:00,  5.37s/req, est. speed stage-0 tok/s: 424.42, avg e2e_lat: 0.0ms]
INFO 01-14 09:19:46 [omni.py:782]  'e2e_total_time_ms': 5371.440172195435,
INFO 01-14 09:19:46 [omni.py:782]  'e2e_sum_time_ms': 5368.500471115112,
INFO 01-14 09:19:46 [omni.py:782]  'e2e_total_tokens': 2278,
INFO 01-14 09:19:46 [omni.py:782]  'e2e_avg_time_per_request_ms': 5368.500471115112,
INFO 01-14 09:19:46 [omni.py:782]  'e2e_avg_tokens_per_s': 424.3270559920111,
INFO 01-14 09:19:46 [omni.py:782]  'wall_time_ms': 5371.440172195435,
INFO 01-14 09:19:46 [omni.py:782]  'final_stage_id': {'0_6758425a-d69c-4ca6-a531-65bf815e00d7': 0},
INFO 01-14 09:19:46 [omni.py:782]  'stages': [{'stage_id': 0,
INFO 01-14 09:19:46 [omni.py:782]              'requests': 1,
INFO 01-14 09:19:46 [omni.py:782]              'tokens': 2278,
INFO 01-14 09:19:46 [omni.py:782]              'total_time_ms': 5369.201421737671,
INFO 01-14 09:19:46 [omni.py:782]              'avg_time_per_request_ms': 5369.201421737671,
INFO 01-14 09:19:46 [omni.py:782]              'avg_tokens_per_s': 424.2716599860311}],
INFO 01-14 09:19:46 [omni.py:782]  'transfers': []}
[Stage-0] INFO 01-14 09:19:46 [omni_stage.py:673] Received shutdown signal
Request ID: 0_6758425a-d69c-4ca6-a531-65bf815e00d7, Text saved to output_audio/0_6758425a-d69c-4ca6-a531-65bf815e00d7.txt
(Worker_TP1 pid=2970531) [Stage-0] INFO 01-14 09:19:54 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker_TP0 pid=2970530) [Stage-0] INFO 01-14 09:19:54 [multiproc_executor.py:707] Parent process exited, terminating worker

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…e_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: root <root@hk01dgx039.cm.cluster>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0810dae881

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-13T06:34:32Z

    sampling_params_list = [
        thinker_sampling_params,
-        talker_sampling_params,  # code predictor is integrated into talker for Qwen3 Omni
-        code2wav_sampling_params,
+        # talker_sampling_params,  # code predictor is integrated into talker for Qwen3 Omni
+        # code2wav_sampling_params,


Provide per-stage sampling params to match 3-stage pipeline

With only thinker_sampling_params in sampling_params_list, the default Qwen3-Omni Instruct pipeline (three stages in vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml) will raise a ValueError because Omni._run_generation requires len(sampling_params_list) == len(self.stage_list) (vllm_omni/entrypoints/omni.py). This means running the example with the default stage config now fails before any generation occurs; it only works if users manually supply a single-stage config (e.g., thinking-only), which isn’t the default for this model.

Useful? React with 👍 / 👎.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

… Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…RequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…rmat Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Gaohan123

LGTM, such a huge work. Thanks!

JerryKwan · 2026-01-19T06:59:53Z

Tried to setup vllm-omini with GLM-Image support, but encounterd a lot of errors using vLLM:v0.14.0 and latest vLLM-omni (commit: 5e7035e and dev/rebase_0.14.0).
Any suggestion which versions should I use?

* init and registry Signed-off-by: JaredforReal <w13431838023@gmail.com> * implement glm_image_transformer.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * update transformer Signed-off-by: JaredforReal <w13431838023@gmail.com> * init pipeline_glm_image.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * init pipeline_glm_image.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * remove pre process Signed-off-by: JaredforReal <w13431838023@gmail.com> * add check_input(), implement CFG parallel in diffuse(), align generate_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com> * fix check_input(prompt_embed), add KVCache for Image Edit Signed-off-by: JaredforReal <w13431838023@gmail.com> * print out vllm version Signed-off-by: root <root@hk01dgx039.cm.cluster> * update model config Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update worker Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update one import in AsyncOmniLLM (not finish all, but can run) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update Qwen3 Omni ViT init based on updated interface (the update for Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Remove unnecessary override for OmniRequestState (the update for OmniRequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update model runner dummy run Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update ar scheduler Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update _preprocess, execute model and sample_tokens for AR Model Runner Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * debug AR Scheduler Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update OmniGPUModelRunner._update_states Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update the offline LLM request sorting due to changed requested id format Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update Qwen3 Omni to fit with the engine core logic Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update generation model runner Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * debug GLM-Image Model Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * remove deleted args from doc string Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline) (vllm-project#763) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> * disable async scheduling for generation models, avoiding inconsistency from race condition Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Update Qwen 3 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Fix] GLM Image (vllm-project#799) Signed-off-by: JaredforReal <w13431838023@gmail.com> * support online serving for Qwen3 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * fix pre-commit Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * inherit engine outputs Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * supporting audio in video(not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Update Qwen2.5 Omni model to version 0.14, adding support for image and video input processing, and refining position handling for MRoPE. Adjustments made to the YAML configuration to disable async scheduling for consistency. Code cleanup and formatting improvements included. Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> * debug qwen 2.5 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update doc Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * rebase to vllm 0.14.0 Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * unify query type Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * fix build doc Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Dev/rebase 0.14.0 (vllm-project#813) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: TangPeng <85704592@qq.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: iwzbi <wzbi@zju.edu.cn> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com> Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com> * update test import Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update version from 0.14.0rc2 to 0.14.0 Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * set vllm config for all CI Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update CI Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Fix CPU offload OOM and performance issues in GLM-Image pipeline * Fix CPU offload OOM and performance issues in GLM-Image pipeline - Conditionally load vision_language_encoder, text_encoder, and vae to GPU only when CPU offload is disabled - Propagate cpu_offload_gb argument to enable_cpu_offload flag - Include vision_language_encoder in CPU offload hooks for proper AR model offloading - Fix device mismatch in generate_prior_tokens during CPU offload mode * Fix shared memory broadcast hang in GLM-Image pipeline - Add manual encoder activation support to SequentialOffloader - Explicitly trigger vision_language_encoder onload before get_image_features in pipeline - Prevents CPU-bound stalling during AR generation when offload is active * Fix device mismatch in generate() by triggering offload hook * Clean up temporary patch files --------- Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: TangPeng <85704592@qq.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: iwzbi <wzbi@zju.edu.cn> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> Co-authored-by: tzhouam <tzhouam@connect.ust.hk> Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com> Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

RocketRider · 2026-01-24T11:24:37Z

I tried it with 0.14.0rc1, but it is not starting:

WARNING 01-24 02:19:12 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
WARNING 01-24 02:19:13 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] WARNING 01-24 02:19:17 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-24 02:19:18 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/entrypoints/omni_stage.py", line 1021, in _stage_worker_async_entry
    asyncio.run(_stage_worker_async(omni_stage, model, stage_payload, batch_timeout, stage_init_timeout))
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/entrypoints/omni_stage.py", line 1221, in _stage_worker_async
    stage_engine = AsyncOmniDiffusion(
                   ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/entrypoints/async_omni_diffusion.py", line 103, in __init__
    self.engine: DiffusionEngine = DiffusionEngine.make_engine(od_config)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/diffusion_engine.py", line 164, in make_engine
    return DiffusionEngine(config)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/diffusion_engine.py", line 43, in __init__
    self.post_process_func = get_diffusion_post_process_func(od_config)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/registry.py", line 250, in get_diffusion_post_process_func
    return _load_process_func(od_config, func_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/registry.py", line 241, in _load_process_func
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/models/glm_image/__init__.py", line 5, in <module>
    from vllm_omni.diffusion.models.glm_image.glm_image_transformer import (
  File "/usr/local/lib/python3.12/dist-packages/vllm_omni/diffusion/models/glm_image/glm_image_transformer.py", line 12, in <module>
    from diffusers.models.transformers.transformer_glm_image import GlmImageCombinedTimestepSizeEmbeddings
ModuleNotFoundError: No module named 'diffusers.models.transformers.transformer_glm_image'```

hsliuustc0106 · 2026-01-24T14:25:42Z

Tried to setup vllm-omini with GLM-Image support, but encounterd a lot of errors using vLLM:v0.14.0 and latest vLLM-omni (commit: 5e7035e and dev/rebase_0.14.0). Any suggestion which versions should I use?

please check #920 for details, some works is still ongoing

JaredforReal and others added 9 commits January 8, 2026 17:55

init and registry

3059e27

Signed-off-by: JaredforReal <w13431838023@gmail.com>

implement glm_image_transformer.py

c0a7684

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update transformer

800cea4

Signed-off-by: JaredforReal <w13431838023@gmail.com>

init pipeline_glm_image.py

8664695

Signed-off-by: JaredforReal <w13431838023@gmail.com>

init pipeline_glm_image.py

b88b4b2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

remove pre process

b9108f4

Signed-off-by: JaredforReal <w13431838023@gmail.com>

add check_input(), implement CFG parallel in diffuse(), align generat…

371afd5

…e_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix check_input(prompt_embed), add KVCache for Image Edit

3d4f5f2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

print out vllm version

0810dae

Signed-off-by: root <root@hk01dgx039.cm.cluster>

tzhouam requested a review from hsliuustc0106 as a code owner January 13, 2026 06:31

update model config

8e36c51

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

chatgpt-codex-connector Bot reviewed Jan 13, 2026

View reviewed changes

tzhouam changed the title ~~[Debug] Print vllm version in end2end~~ [Rebase] Rebase to v0.14.0 Jan 13, 2026

tzhouam added 9 commits January 13, 2026 07:23

update worker

7f704d5

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update one import in AsyncOmniLLM (not finish all, but can run)

4afb2ff

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update Qwen3 Omni ViT init based on updated interface (the update for…

cb2e053

… Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Remove unnecessary override for OmniRequestState (the update for Omni…

e052c4a

…RequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update model runner dummy run

c08dcdd

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update ar scheduler

166fc78

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update _preprocess, execute model and sample_tokens for AR Model Runner

4db8f0b

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

debug AR Scheduler

63a69a5

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update OmniGPUModelRunner._update_states

5bcdb43

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

david6666666 mentioned this pull request Jan 13, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

38 tasks

update the offline LLM request sorting due to changed requested id fo…

2a0f72f

…rmat Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

david6666666 mentioned this pull request Jan 14, 2026

[Feature]: [Rebase] Rebase to v0.14.0 JiusiServe/vllm-omni#67

Closed

2 tasks

david6666666 added this to the v0.14.0rc1 milestone Jan 14, 2026

update Qwen3 Omni to fit with the engine core logic

f7c8af9

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

tzhouam changed the title ~~[Rebase] Rebase to v0.14.0~~ [Rebase] Partial Rebase to v0.14.0 (Support AR Offiline) Jan 14, 2026

tzhouam added 2 commits January 14, 2026 08:27

Merge PR vllm-project#724

f12e0af

update generation model runner

e2462d2

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

tzhouam changed the title ~~[Rebase] Partial Rebase to v0.14.0 (Support AR Offiline)~~ [Rebase] Partial Rebase to v0.14.0 (Support AR Offiline) and add GLM-Image Jan 14, 2026

debug GLM-Image Model

d89e3c4

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

tzhouam changed the title ~~[Rebase] Partial Rebase to v0.14.0 (Support AR Offiline) and add GLM-Image~~ [Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline) Jan 14, 2026

remove deleted args from doc string

f269e0e

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Gaohan123 approved these changes Jan 14, 2026

View reviewed changes

Gaohan123 merged commit 12b69f4 into vllm-project:dev/rebase_0.14.0 Jan 14, 2026
2 checks passed

JaredforReal mentioned this pull request Jan 15, 2026

[Model]GLM Image #724

Closed

5 tasks

david6666666 mentioned this pull request Jan 16, 2026

vLLM-Omni Model Support #808

Open

63 tasks

lishunyang12 mentioned this pull request Feb 5, 2026

[Bug]: ModuleNotFoundError for model-specific pipelines (LongCat-Image / GLM-Image) in diffusers namespace #1215

Closed

wtomin mentioned this pull request Mar 4, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

herotai214 mentioned this pull request May 6, 2026

[Bug]: GLM-Image CFG Parallel is not working #3382

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline)#763

[Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline)#763
Gaohan123 merged 25 commits into
vllm-project:dev/rebase_0.14.0from
tzhouam:dev/rebase-0.14.0

tzhouam commented Jan 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 13, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

Uh oh!

JerryKwan commented Jan 19, 2026

Uh oh!

RocketRider commented Jan 24, 2026

Uh oh!

hsliuustc0106 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

tzhouam commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Installation

GLM-Image

Test Plan

Test Result

Qwen 3 Omni

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JerryKwan commented Jan 19, 2026

Uh oh!

RocketRider commented Jan 24, 2026

Uh oh!

hsliuustc0106 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tzhouam commented Jan 13, 2026 •

edited

Loading