Skip to content

[Fix]Ensure HuggingFace downloads complete before initialization.#1213

Merged
Gaohan123 merged 22 commits into
vllm-project:mainfrom
zzhuoxin1508:fix/load-before-init
Feb 10, 2026
Merged

[Fix]Ensure HuggingFace downloads complete before initialization.#1213
Gaohan123 merged 22 commits into
vllm-project:mainfrom
zzhuoxin1508:fix/load-before-init

Conversation

@zzhuoxin1508
Copy link
Copy Markdown
Contributor

Purpose

This PR enhances the startup stability of multimodal models within multi-stage pipelines. By ensuring the Orchestrator completes all critical file downloads before spawning any Stage Workers, eliminate issues related to concurrent download conflicts, and initialization timeouts in multi-process environments.

Solution

  • Enabled recursive mode (**/*.ext) in omni_snapshot_download. This forces the Orchestrator to fully pull and verify all model files (including subdirectories) before initializing child processes.
  • Implemented require_all logic within download_weights_from_hf_specific. When enabled, the downloader ensures every defined matching pattern is accurately validated and successfully downloaded.
  • Refactored the omni_snapshot_download logic to prioritize local path validation.

Test Plan

Validated the fix using the Tongyi-MAI/Z-Image-Turbo model.

python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png

Test Result

(vllm-omni-env) root@91f3d7b48993:/workspace/vllm-omni/examples/offline_inference/text_to_image# python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png
INFO 02-05 02:12:36 [weight_utils.py:49] Using model weights format ['*.json', '*.bin', '*.safetensors', '*.pt', '*.txt', '*.model', '*.yaml']
model_index.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 467/467 [00:00<00:00, 4.79MB/s]
scheduler_config.json: 100%|███████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<00:00, 1.37MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 726/726 [00:00<00:00, 9.06MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 1.88MB/s]
model.safetensors.index.json: 32.8kB [00:00, 77.9MB/s]
tokenizer/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 25.9MB/s]
tokenizer_config.json: 9.73kB [00:00, 28.8MB/s]
vocab.json: 2.78MB [00:00, 95.6MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 473/473 [00:00<00:00, 3.35MB/s]
(…)ion_pytorch_model.safetensors.index.json: 49.0kB [00:00, 130MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 805/805 [00:00<00:00, 10.3MB/s]
text_encoder/model-00001-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 3.96G/3.96G [00:30<00:00, 131MB/s]
text_encoder/model-00002-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 3.99G/3.99G [00:30<00:00, 132MB/s]
text_encoder/model-00003-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 99.6M/99.6M [00:00<00:00, 123MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 9.97G/9.97G [00:39<00:00, 252MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 9.97G/9.97G [00:40<00:00, 247MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 4.67G/4.67G [00:19<00:00, 243MB/s]
vae/diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████████████| 168M/168M [00:01<00:00, 98.0MB/s]
merges.txt: 1.67MB [00:00, 75.0MB/s]
INFO 02-05 02:15:22 [weight_utils.py:70] Time spent downloading weights for Tongyi-MAI/Z-Image-Turbo: 165.555125 seconds
INFO 02-05 02:15:22 [omni.py:137] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6
INFO 02-05 02:15:22 [initialization.py:35] No OmniTransferConfig provided
INFO 02-05 02:15:22 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'enable_layerwise_offload': False, 'layerwise_num_gpu_layers': 1, 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'enable_cache_dit_summary': False, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'enforce_eager': False, 'enable_cpu_offload': False, 'model': '/workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6', 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 02-05 02:15:22 [omni.py:356] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] INFO 02-05 02:15:31 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6
[Stage-0] INFO 02-05 02:15:31 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-0] INFO 02-05 02:15:52 [multiproc_executor.py:74] Starting server...
[Stage-0] INFO 02-05 02:16:02 [diffusion_worker.py:269] Worker 0 created result MessageQueue
[Stage-0] INFO 02-05 02:16:02 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
[Stage-0] INFO 02-05 02:16:02 [vllm.py:624] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 02-05 02:16:02 [diffusion_worker.py:95] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 02-05 02:16:02 [parallel_state.py:565] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
[Stage-0] INFO 02-05 02:16:02 [parallel_state.py:607] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.74it/s]
[Stage-0] INFO 02-05 02:16:05 [z_image_transformer.py:619] Z-Image init: dim=3840 n_heads=30 n_kv_heads=30 ffn_hidden_dim=10240 final_out_dims=(64,) tp=1 (supported_tp=(1, 2))
[Stage-0] INFO 02-05 02:16:05 [platform.py:77] Defaulting to diffusion attention backend FLASH_ATTN
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:09,  4.79s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:08<00:04,  4.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:10<00:00,  3.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:10<00:00,  3.62s/it]

[Stage-0] INFO 02-05 02:16:17 [diffusers_loader.py:227] Loading weights took 11.11 seconds
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:103] Model loading took 19.1516 GiB and 14.864855 seconds
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:108] Model runner: Model loaded successfully.
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:122] Model runner: Model compiled with torch.compile.
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:137] Model runner: Initialization complete.
[Stage-0] INFO 02-05 02:16:17 [manager.py:90] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:126] Worker 0: Initialization complete.
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:393] Worker 0: Scheduler loop started.
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:320] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 02-05 02:16:17 [scheduler.py:38] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 02-05 02:16:17 [diffusion_engine.py:337] dummy run to warm up the model
[Stage-0] INFO 02-05 02:16:17 [manager.py:538] Deactivating all adapters: 0 layers
[Stage-0] WARNING 02-05 02:16:17 [kv_transfer_manager.py:452] Request has no ID, cannot receive KV cache
[Stage-0] INFO 02-05 02:16:31 [omni_stage.py:740] Max batch size: 1
INFO 02-05 02:16:31 [omni.py:349] [Orchestrator] Stage-0 reported ready
INFO 02-05 02:16:31 [omni.py:375] [Orchestrator] All stages initialized successfully

Essential Elements of an Effective PR Description Checklist
  • [ x ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ x ] The test plan, such as providing test command.
  • [ x ] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1edaa8168

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni.py Outdated
Comment on lines +91 to +93
allow_patterns=[
"**/*.json", "**/*.bin", "**/*.safetensors", "**/*.pt",
"**/*.txt", "**/*.model", "**/*.yaml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include legacy .npy assets in HF prefetch

The new prefetch list omits *.npy, but the Qwen2.5 Omni loader explicitly supports legacy speaker assets stored under inputs/*spk_emb.npy and inputs/*ref_mel.npy in the model directory (see _init_token2wav_model in vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py). Because omni_snapshot_download now converts repo IDs into a local snapshot path, downstream loaders will treat it as a local directory and won’t fall back to Hugging Face to fetch missing files. For models that only ship the legacy .npy assets (no spk_dict.pt), this change silently drops conditioning data and forces the fallback zeros path, which breaks speaker conditioning quality. Consider adding **/*.npy (or using * for the prefetch) to avoid losing these files.

Useful? React with 👍 / 👎.

@hsliuustc0106 hsliuustc0106 requested a review from Copilot February 5, 2026 04:33
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix precommits please

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

could you please also test the qwen2.5-omni model?

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

could you please also test the qwen2.5-omni model?

ok i'll try it

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses initialization issues in multimodal models within multi-stage pipelines by ensuring that the Orchestrator completes all critical file downloads before spawning Stage Workers. This eliminates concurrent download conflicts and initialization timeouts in multi-process environments.

Changes:

  • Added require_all parameter to download_weights_from_hf_specific to force downloading all matching patterns
  • Refactored omni_snapshot_download to use recursive glob patterns (**/*.ext) and the new require_all functionality
  • Added local path validation to omni_snapshot_download for optimization

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
vllm_omni/model_executor/model_loader/weight_utils.py Added require_all parameter to control whether all patterns should be downloaded
vllm_omni/entrypoints/omni.py Refactored snapshot download to use recursive patterns and ensure complete downloads

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/model_executor/model_loader/weight_utils.py Outdated
Comment thread vllm_omni/model_executor/model_loader/weight_utils.py Outdated
Comment thread vllm_omni/entrypoints/omni.py
Comment thread vllm_omni/entrypoints/omni.py
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

qwen2.5-omni test

@hsliuustc0106 @lishunyang12 @nussejzz PTAL

(workspace) root@72e628fb7449:/workspace/vllm-omni/examples/offline_inference/qwen2_5_omni# bash run_single_prompt.sh
INFO 02-05 05:55:11 [weight_utils.py:49] Using model weights format ['*.json', '*.bin', '*.safetensors', '*.pt', '*.txt', '*.model', '*.yaml']
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 5.77MB/s]
chat_template.json: 1.31kB [00:00, 4.78MB/s]
config.json: 13.2kB [00:00, 64.8MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<00:00, 910kB/s]
model.safetensors.index.json: 233kB [00:00, 326MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 667/667 [00:00<00:00, 4.76MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████| 832/832 [00:00<00:00, 5.19MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 22.7MB/s]
tokenizer_config.json: 6.47kB [00:00, 22.5MB/s]
vocab.json: 2.78MB [00:00, 85.8MB/s]
model-00001-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 129MB/s]
model-00002-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 128MB/s]
model-00003-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 131MB/s]
model-00004-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:37<00:00, 132MB/s]
model-00005-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 2.43G/2.43G [00:12<00:00, 196MB/s]
spk_dict.pt: 100%|████████████████████████████████████████████████████████████████████████████████████████| 260k/260k [00:00<00:00, 925kB/s]
merges.txt: 1.67MB [00:00, 141MB/s]
INFO 02-05 05:58:00 [weight_utils.py:70] Time spent downloading weights for Qwen/Qwen2.5-Omni-7B: 169.257966 seconds
INFO 02-05 05:58:00 [omni.py:137] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
INFO 02-05 05:58:00 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 02-05 05:58:00 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
INFO 02-05 05:58:00 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
INFO 02-05 05:58:00 [factory.py:46] Created connector: SharedMemoryConnector
INFO 02-05 05:58:00 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
INFO 02-05 05:58:00 [factory.py:46] Created connector: SharedMemoryConnector
INFO 02-05 05:58:00 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'latent', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'max_num_seqs': 1, 'async_chunk': False}, 'is_comprehension': True, 'final_output': True, 'final_output_type': 'text', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '1', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'talker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'engine_output_type': 'latent', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [0], 'custom_process_input_func': 'vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker', 'default_sampling_params': {'temperature': 0.9, 'top_p': 0.8, 'top_k': 40, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.05, 'stop_token_ids': [8294]}}
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 2, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'code2wav', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'generation', 'scheduler_cls': 'vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler', 'gpu_memory_utilization': 0.15, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'async_scheduling': False, 'engine_output_type': 'audio', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [1], 'final_output': True, 'final_output_type': 'audio', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
INFO 02-05 05:58:00 [omni.py:356] [Orchestrator] Waiting for 3 stages to initialize (timeout: 300s)
[Stage-2] INFO 02-05 05:58:09 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-1] INFO 02-05 05:58:09 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-1] INFO 02-05 05:58:09 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-2] INFO 02-05 05:58:09 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-2] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-2] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-2] INFO 02-05 05:58:09 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-2] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-1] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-1] INFO 02-05 05:58:09 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-1] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:10 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-0] INFO 02-05 05:58:10 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-2] INFO 02-05 05:58:18 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-2] INFO 02-05 05:58:18 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-1] INFO 02-05 05:58:18 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-1] INFO 02-05 05:58:18 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-2] INFO 02-05 05:58:29 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-2] INFO 02-05 05:58:29 [model.py:1561] Using max model len 32768
[Stage-2] INFO 02-05 05:58:29 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-2] INFO 02-05 05:58:29 [vllm.py:624] Asynchronous scheduling is disabled.
[Stage-2] WARNING 02-05 05:58:29 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-2] INFO 02-05 05:58:29 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-1] INFO 02-05 05:58:29 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-1] INFO 02-05 05:58:29 [model.py:1561] Using max model len 32768
[Stage-1] INFO 02-05 05:58:29 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-1] INFO 02-05 05:58:29 [vllm.py:624] Asynchronous scheduling is enabled.
[Stage-1] WARNING 02-05 05:58:29 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-1] INFO 02-05 05:58:29 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:39 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:39 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:40 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:58453 backend=nccl
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:40 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:54859 backend=nccl
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:40 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:40 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=4615) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=4681) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:624] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [utils.py:59] Trying to guess the arguments for old-style model class <class 'vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavModel'>
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 98.47it/s]
(EngineCore_DP0 pid=4615) 
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:45 [qwen2_5_omni_token2wav.py:1759] [Model Loaded] name=Qwen2_5OmniToken2WavForConditionalGenerationVLLM, success=True, size=1492.80 MB, device=cuda:0
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:45 [default_loader.py:291] Loading weights took 1.34 seconds
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:46 [gpu_model_runner.py:4118] Model loading took 1.46 GiB memory and 2.011382 seconds
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:46 [qwen2_5_omni.py:943] Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [gpu_generation_model_runner.py:418] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:47 [core.py:272] init engine (profile, create kv cache, warmup model) took 1.35 seconds
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [core.py:129] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:47 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-2] INFO 02-05 05:58:48 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-2] INFO 02-05 05:58:48 [initialization.py:288] [Stage-2] Initializing OmniConnectors with config keys: ['from_stage_1']
[Stage-2] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:48 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:58:48 [omni.py:349] [Orchestrator] Stage-2 reported ready
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:48 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-0] INFO 02-05 05:58:48 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-0] INFO 02-05 05:58:48 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-0] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:48 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-0] INFO 02-05 05:58:48 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:58 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-0] INFO 02-05 05:58:58 [model.py:1561] Using max model len 32768
[Stage-0] INFO 02-05 05:58:58 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 02-05 05:58:58 [vllm.py:624] Asynchronous scheduling is enabled.
[Stage-0] WARNING 02-05 05:58:58 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-0] INFO 02-05 05:58:58 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:04 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:04 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 96.70it/s]
(EngineCore_DP0 pid=4681) 
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:08 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:09 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:33551 backend=nccl
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:09 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=5809) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:12 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [qwen2_5_omni_thinker.py:272] flash_attn is not available, the model may not yield the exactly same result as the transformers implementation in the audio tower part.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 98.96it/s]
(EngineCore_DP0 pid=5809) 
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:17 [default_loader.py:291] Loading weights took 3.89 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:18 [qwen2_5_omni_talker.py:196] [Model Loaded] name=Qwen2_5OmniTalkerForConditionalGeneration, success=True, size=5087.96 MB, device=cuda:0
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:18 [gpu_model_runner.py:4118] Model loading took 16.74 GiB memory and 4.539376 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:18 [default_loader.py:291] Loading weights took 14.11 seconds
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:18 [gpu_model_runner.py:4946] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:19 [gpu_model_runner.py:4118] Model loading took 6.03 GiB memory and 34.718209 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:19 [gpu_model_runner.py:4946] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [gpu_worker.py:356] Available KV cache memory: 28.23 GiB
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [kv_cache_utils.py:1307] GPU KV cache size: 616,784 tokens
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 18.82x
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [core.py:272] init engine (profile, create kv cache, warmup model) took 4.18 seconds
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:59:23 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:59:23 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-1] INFO 02-05 05:59:24 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-1] INFO 02-05 05:59:24 [initialization.py:288] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 02-05 05:59:24 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:59:24 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 02-05 05:59:24 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:59:24 [omni.py:349] [Orchestrator] Stage-1 reported ready
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [gpu_worker.py:356] Available KV cache memory: 17.09 GiB
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [kv_cache_utils.py:1307] GPU KV cache size: 320,000 tokens
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 9.77x
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 7.36 seconds
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:25 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:26 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:26 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-0] INFO 02-05 05:59:26 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-0] INFO 02-05 05:59:26 [initialization.py:288] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 02-05 05:59:26 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:59:26 [omni.py:349] [Orchestrator] Stage-0 reported ready
INFO 02-05 05:59:26 [omni.py:375] [Orchestrator] All stages initialized successfully

zzhuoxin1508 and others added 2 commits February 5, 2026 14:26
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Qwen-image-edit. It will break.

(workspace) root@925981d52983:/workspace/vllm-omni/examples/offline_inference/image_to_image# python image_edit.py \
  --image qwen-bear.png \
  --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
  --output output_image_edit.png \
  --num_inference_steps 50 \
  --cfg_scale 4.0
INFO 02-06 15:54:04 [weight_utils.py:50] Using model weights format ['**/*.json', '**/*.bin', '**/*.safetensors', '**/*.pt', '**/*.txt', '**/*.model', '**/*.yaml']
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:00<00:00, 1.68MB/s]
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 2.61MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.54MB/s]
processor/tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 11.0MB/s]
tokenizer_config.json: 4.73kB [00:00, 16.3MB/s]
video_preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 904/904 [00:00<00:00, 7.05MB/s]
vocab.json: 2.78MB [00:00, 18.2MB/s]
scheduler_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 485/485 [00:00<00:00, 2.00MB/s]
config.json: 3.22kB [00:00, 7.97MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 244/244 [00:00<00:00, 1.02MB/s]
model.safetensors.index.json: 57.7kB [00:00, 102MB/s]
tokenizer_config.json: 4.69kB [00:00, 12.3MB/s]
vocab.json: 3.38MB [00:00, 53.7MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:00<00:00, 1.37MB/s]
(…)ion_pytorch_model.safetensors.index.json: 199kB [00:00, 110MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 730/730 [00:00<00:00, 3.07MB/s]
text_encoder/model-00001-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:12<00:00, 398MB/s]
text_encoder/model-00002-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:12<00:00, 408MB/s]
text_encoder/model-00003-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:12<00:00, 399MB/s]
text_encoder/model-00004-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [00:04<00:00, 353MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:11<00:00, 422MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:11<00:00, 421MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:33<00:00, 149MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:12<00:00, 411MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:11<00:00, 422MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:11<00:00, 427MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.91G/4.91G [00:12<00:00, 394MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:12<00:00, 398MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:03<00:00, 317MB/s]
vae/diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 254M/254M [00:01<00:00, 156MB/s]
merges.txt: 1.67MB [00:00, 41.0MB/s]
INFO 02-06 15:56:58 [weight_utils.py:71] Time spent downloading weights for Qwen/Qwen-Image-Edit: 174.753237 seconds
INFO 02-06 15:56:58 [omni.py:132] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de
Traceback (most recent call last):
  File "/workspace/.venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 604, in get_config
    raise ValueError(
ValueError: Could not detect config format for no config file found. With config_format 'auto', ensure your model has either config.json (HF format) or params.json (Mistral format). Otherwise please specify your_custom_config_format in engine args for customized config parser.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/vllm-omni/vllm_omni/entrypoints/utils.py", line 139, in resolve_model_config_path
    hf_config = get_config(model, trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 625, in get_config
    raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: '/workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de'.
Please verify the following requirements:
1. Provide a valid Hugging Face repository ID.
2. Specify a local directory that contains a recognized configuration file.
   - For Hugging Face models: ensure the presence of a 'config.json'.
   - For Mistral models: ensure the presence of a 'params.json'.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/vllm-omni/examples/offline_inference/image_to_image/image_edit.py", line 492, in <module>
    main()
  File "/workspace/vllm-omni/examples/offline_inference/image_to_image/image_edit.py", line 362, in main
    omni = Omni(
           ^^^^^
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 535, in __init__
    super().__init__(model, **kwargs)
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 133, in __init__
    self._initialize_stages(model, kwargs)
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 221, in _initialize_stages
    self.config_path = resolve_model_config_path(model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm-omni/vllm_omni/entrypoints/utils.py", line 162, in resolve_model_config_path
    raise ValueError(
ValueError: Could not determine model_type for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de. Model is not in standard transformers format and does not have model_index.json. Please ensure the model has proper configuration files with 'model_type' field
(workspace) root@925981d52983:/workspace/vllm-omni/examples/offline_inference/image_to_image# 


@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Feb 6, 2026

Can you take a look on how diffuser and vllm handle this situation? Track the respective code and try to run their examples.

@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

Can you take a look on how diffuser and vllm handle this situation? Track the respective code and try to run their examples.

ok

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@tzhouam tzhouam self-requested a review February 9, 2026 07:21
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 9, 2026
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

I have updated the code,Fix weight downloader logic to include root-level files. Verified successful initialization and output for Qwen-Image-Edit, Qwen2.5-Omni, and Wan2.2-5B-Diffusers. @hsliuustc0106 @lishunyang12

@lishunyang12
Copy link
Copy Markdown
Collaborator

I have updated the code,Fix weight downloader logic to include root-level files. Verified successful initialization and output for Qwen-Image-Edit, Qwen2.5-Omni, and Wan2.2-5B-Diffusers. @hsliuustc0106 @lishunyang12

LGTM. Can help to check on bagel? @princepride

1 similar comment
@lishunyang12
Copy link
Copy Markdown
Collaborator

I have updated the code,Fix weight downloader logic to include root-level files. Verified successful initialization and output for Qwen-Image-Edit, Qwen2.5-Omni, and Wan2.2-5B-Diffusers. @hsliuustc0106 @lishunyang12

LGTM. Can help to check on bagel? @princepride

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508 zzhuoxin1508 marked this pull request as draft February 9, 2026 10:40
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

bagel test ,verified successful initialization and output.
@lishunyang12 @princepride
(workspace) root@240abbafaded:/workspace/vllm-omni/examples/offline_inference/bagel# python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img
--prompts "A cute cat"
INFO 02-09 12:04:06 [weight_utils.py:50] Using model weights format ['*']
.gitattributes: 1.52kB [00:00, 6.85MB/s]
README.md: 6.79kB [00:00, 19.9MB/s]
ae.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████| 335M/335M [00:03<00:00, 95.6MB/s]
config.json: 1.44kB [00:00, 5.89MB/s]
ema.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████| 29.2G/29.2G [03:57<00:00, 123MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 243/243 [00:00<00:00, 1.76MB/s]
llm_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 663/663 [00:00<00:00, 4.06MB/s]
merges.txt: 1.67MB [00:00, 28.8MB/s]
model.safetensors.index.json: 123kB [00:00, 210MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 392/392 [00:00<00:00, 2.54MB/s]
tokenizer.json: 7.03MB [00:00, 130MB/s]
tokenizer_config.json: 7.30kB [00:00, 25.4MB/s]
vit_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 205/205 [00:00<00:00, 1.57MB/s]
vocab.json: 2.78MB [00:00, 80.0MB/s]
INFO 02-09 12:08:08 [weight_utils.py:82] Time spent downloading weights for ByteDance-Seed/BAGEL-7B-MoT: 241.947614 seconds
INFO 02-09 12:08:08 [omni.py:135] Initializing stages for model: /workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 02-09 12:08:08 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 02-09 12:08:08 [initialization.py:234] Loaded OmniTransferConfig with 1 connector configurations
INFO 02-09 12:08:08 [factory.py:46] Created connector: SharedMemoryConnector
INFO 02-09 12:08:08 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
INFO 02-09 12:08:08 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'BagelForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.35, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'text', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_send_cache': True, 'kv_transfer_criteria': {'type': 'prefill_finished'}}, 'max_num_seqs': 1, 'async_chunk': False}, 'final_output': True, 'final_output_type': 'text', 'is_comprehension': True, 'default_sampling_params': {'temperature': 0.4, 'top_p': 0.9, 'top_k': 1, 'max_tokens': 2048, 'seed': 52, 'detokenize': True, 'repetition_penalty': 1.05}}
INFO 02-09 12:08:08 [omni_stage.py:239] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'diffusion', 'runtime': {'devices': '1', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'dit', 'gpu_memory_utilization': 0.55, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'image', 'distributed_executor_backend': 'mp', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'tensor_parallel_size': 1, 'omni_kv_config': {'need_recv_cache': True}}, 'engine_input_source': [0], 'final_output': True, 'final_output_type': 'image', 'is_comprehension': False, 'default_sampling_params': {'seed': 52}}
INFO 02-09 12:08:08 [omni.py:354] [Orchestrator] Waiting for 2 stages to initialize (timeout: 300s)
[Stage-1] INFO 02-09 12:08:17 [omni_stage.py:636] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b
[Stage-1] INFO 02-09 12:08:17 [omni_stage.py:649] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-0] INFO 02-09 12:08:17 [omni_stage.py:636] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b
[Stage-0] INFO 02-09 12:08:17 [omni_stage.py:649] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 02-09 12:08:17 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-0] INFO 02-09 12:08:17 [initialization.py:234] Loaded OmniTransferConfig with 1 connector configurations
[Stage-0] INFO 02-09 12:08:17 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 02-09 12:08:17 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 02-09 12:08:18 [multiproc_executor.py:74] Starting server...
[Stage-0] INFO 02-09 12:08:25 [model.py:541] Resolved architecture: BagelForConditionalGeneration
[Stage-0] INFO 02-09 12:08:25 [model.py:1561] Using max model len 32768
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 02-09 12:08:25 [model.py:222] Resolved architecture: BagelForConditionalGeneration
[Stage-0] INFO 02-09 12:08:25 [model.py:1561] Using max model len 32768
[Stage-0] INFO 02-09 12:08:25 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 02-09 12:08:25 [vllm.py:624] Asynchronous scheduling is enabled.
[Stage-0] WARNING 02-09 12:08:25 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-0] INFO 02-09 12:08:25 [vllm.py:762] Cudagraph is disabled under eager mode
/workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
[Stage-1] INFO 02-09 12:08:28 [diffusion_worker.py:269] Worker 0 created result MessageQueue
[Stage-1] INFO 02-09 12:08:28 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
[Stage-1] INFO 02-09 12:08:28 [vllm.py:624] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-1] INFO 02-09 12:08:28 [diffusion_worker.py:95] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-1] INFO 02-09 12:08:28 [parallel_state.py:565] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
[Stage-1] INFO 02-09 12:08:28 [parallel_state.py:607] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.31s/it]

[Stage-1] INFO 02-09 12:08:32 [pipeline_bagel.py:576] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
[Stage-1] INFO 02-09 12:08:32 [diffusers_loader.py:227] Loading weights took 2.70 seconds
[Stage-1] INFO 02-09 12:08:33 [diffusion_model_runner.py:103] Model loading took 26.5048 GiB and 4.745859 seconds
[Stage-1] INFO 02-09 12:08:33 [diffusion_model_runner.py:108] Model runner: Model loaded successfully.
[Stage-1] INFO 02-09 12:08:33 [diffusion_model_runner.py:137] Model runner: Initialization complete.
[Stage-1] INFO 02-09 12:08:33 [manager.py:90] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
[Stage-1] INFO 02-09 12:08:33 [diffusion_worker.py:126] Worker 0: Initialization complete.
[Stage-1] INFO 02-09 12:08:33 [diffusion_worker.py:393] Worker 0: Scheduler loop started.
[Stage-1] INFO 02-09 12:08:33 [diffusion_worker.py:320] Worker 0 ready to receive requests via shared memory
[Stage-1] INFO 02-09 12:08:33 [scheduler.py:38] SyncScheduler initialized result MessageQueue
[Stage-1] INFO 02-09 12:08:33 [diffusion_engine.py:337] dummy run to warm up the model
[Stage-1] INFO 02-09 12:08:33 [manager.py:538] Deactivating all adapters: 0 layers
[Stage-1] WARNING 02-09 12:08:33 [kv_transfer_manager.py:452] Request has no ID, cannot receive KV cache
[Stage-1] INFO 02-09 12:08:35 [initialization.py:288] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 02-09 12:08:35 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-09 12:08:35 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 02-09 12:08:35 [omni_stage.py:728] Max batch size: 1
INFO 02-09 12:08:35 [omni.py:347] [Orchestrator] Stage-1 reported ready
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:08:35 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2948) [Stage-0] WARNING 02-09 12:08:35 [multiproc_executor.py:910] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
/workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
[Stage-0] INFO 02-09 12:08:45 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:47331 backend=nccl
[Stage-0] INFO 02-09 12:08:45 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(Worker pid=3169) [Stage-0] INFO 02-09 12:08:46 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--ByteDance-Seed--BAGEL-7B-MoT/snapshots/5019f57d168e5816e8f3f701b17cc816bb7cf24b...
(Worker pid=3169) [Stage-0] INFO 02-09 12:08:46 [vllm.py:624] Asynchronous scheduling is enabled.
(Worker pid=3169) [Stage-0] WARNING 02-09 12:08:46 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=3169) [Stage-0] INFO 02-09 12:08:46 [vllm.py:762] Cudagraph is disabled under eager mode
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:07 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(Worker pid=3169) [Stage-0] WARNING 02-09 12:09:07 [bagel.py:391] Overriding vit_config.num_hidden_layers from 27 to 26 to match the Bagel model checkpoint.
(Worker pid=3169) [Stage-0] WARNING 02-09 12:09:07 [bagel.py:397] Setting vit_config.vision_use_head to False as it is not present in the Bagel model checkpoint.
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:07 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 57.63it/s]
(Worker pid=3169)
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:08 [default_loader.py:291] Loading weights took 1.43 seconds
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:09 [gpu_model_runner.py:4118] Model loading took 15.04 GiB memory and 22.247672 seconds
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:09 [gpu_model_runner.py:4946] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 6 image items of the maximum feature size.
(Worker pid=3169) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(Worker pid=3169) /workspace/.venv/lib/python3.12/site-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for SiglipImageProcessor.preprocess and were ignored: 'truncation'
(Worker pid=3169) return self.preprocess(images, **kwargs)
(Worker pid=3169) [Stage-0] INFO 02-09 12:09:13 [base.py:75] Available KV cache memory: 11.02 GiB (process-scoped)
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:09:13 [kv_cache_utils.py:1307] GPU KV cache size: 206,272 tokens
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:09:13 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 6.29x
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:09:13 [core.py:272] init engine (profile, create kv cache, warmup model) took 4.47 seconds
(EngineCore_DP0 pid=2948) [Stage-0] WARNING 02-09 12:09:14 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2948) /workspace/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
(EngineCore_DP0 pid=2948) warnings.warn(
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:09:14 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=2948) [Stage-0] WARNING 02-09 12:09:14 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=2948) [Stage-0] INFO 02-09 12:09:14 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-0] INFO 02-09 12:09:14 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-0] INFO 02-09 12:09:14 [initialization.py:288] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 02-09 12:09:14 [omni_stage.py:728] Max batch size: 1
INFO 02-09 12:09:14 [omni.py:347] [Orchestrator] Stage-0 reported ready
INFO 02-09 12:09:14 [omni.py:373] [Orchestrator] All stages initialized successfully

@zzhuoxin1508 zzhuoxin1508 marked this pull request as ready for review February 9, 2026 12:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b3c66c34d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +86 to +90
return download_weights_from_hf_specific(
model_name_or_path=model_id,
cache_dir=None,
allow_patterns=["*"],
require_all=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid forcing full repo download in offline mode

When HF_HUB_OFFLINE is set, download_weights_from_hf_specific passes local_files_only=True, so Hugging Face will only use cached files and will error if requested files are missing. With allow_patterns=["*"] here, the orchestrator now requests the entire repo before spawning stages, which means a partially cached model that previously loaded from the subset it needed will now fail during startup in offline environments. This is a regression for offline workflows where only weights are cached. Consider skipping the prefetch when offline or falling back to the narrower weight patterns in that case.

Useful? React with 👍 / 👎.

@Gaohan123 Gaohan123 added this to the v0.16.0 milestone Feb 10, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@Gaohan123 Gaohan123 merged commit 109bb97 into vllm-project:main Feb 10, 2026
7 checks passed
YanickSchraner pushed a commit to YanickSchraner/vllm-omni that referenced this pull request Feb 20, 2026
…lm-project#1213)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@zzhuoxin1508 zzhuoxin1508 deleted the fix/load-before-init branch March 1, 2026 09:51
with1015 added a commit to with1015/vllm-omni that referenced this pull request Apr 6, 2026
* [Frontend][Model] Support batch request with refined OmniDiffusionReq… (#797)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* [Model]: add FLUX.1-dev model (#853)

* [BugFix] ignore mm data from stages to async omni (#954)

Signed-off-by: dengyunyang <584797741@qq.com>

* Revert "[BugFix] ignore mm data from stages to async omni" (#1023)

* [Bugfix] Modify output to model_runner_output (#1026)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Feature] Support cache-dit for Wan 2.2 inference (#1021)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>

* [Doc]Format profiling doc (#993)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Hardware] Support platforms and plugin system (#774)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Core]: KV Cache Transfer Encapsulation (#979)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Test]Delete skip mark for amd ci test and fix CI failure (#927)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Doc]Specify Qwen3-TTS model name for each task type (#1036)

Signed-off-by: Kyle Huang <yellowsea@gmail.com>

* [Misc] pin version of fa3-fwd (#1051)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [CI] [ROCm] Add more AMD CI tests (#1039)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix qwen image layerd in dummy run (#1027)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [BugFix] Fix noisy output without setting a seed in Qwen Image (#1043)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [bugfix] remove vllm speech route (#1060)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Debug] Update GLM-Image Pipeline (#1049)

Co-authored-by: root <root@hk01dgx028.cm.cluster>

* [Diffusion][Bugfix] Fix the flash_attn backends selection logic (#983)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix the accuracy issue of multimodal input. (#1020)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>

* [Bugfix] Set VaeImageProcessor `do_convert_rgb` True (#1032)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [feat]: adapt batch request for flux (#1028)

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

* [CI] Change Qwen3 Omni stage placement strategy  (#1072)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

* [BugFix] Fix to use correct attn backend (#1038)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [Perf] Qwen3 Omni talker mtp optimization (#1005)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Wan2.2] Optimize memory usage with conditional transformer loading (#980)

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Feat] Support XPU Backend in vLLM-Omni (#191)

Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Fix] stabilize diffusion images LoRA E2E across CI drift (#1075)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Bugfix][Test] Re-enable the log simple tests (#1065)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni (#1025)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (#987)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>

* [Fix] Increase max wait time for server readiness to accommodate model loading (#1089)

Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Benchmark] Add vLLM-Omni Omni model online benchmark (#780)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Remove Mooncake/Yuanrong connector import warning (#1091)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* fix: UnboundLocalError for role in streaming audio/image responses (#784)

Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>

* [Misc] update wechat image (#1096)

* [Feature] Support DiT Layerwise (Blockwise) CPU Offloading (#858)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Modify max_tokens and modify the log and fix #1103 (#1097)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task (#1100)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Add supports_torch_inductor interface (#1108)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error (#1104)

Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] fix request_id of image generation in api server (#1112)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf]: CFG parallel abstraction (#851)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#1082)

* [CI] [ROCm] Quick fix amd ci (#1116)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix benchmark audio timing error and add benchmark test (#1109)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration (#1079)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

* [NPU] Align with GPUModelRunner (#1114)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [FEATURE] /v1/images/edit interface (#1101)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Bugfix] Fix NPU SDPA attention mask shape and semantics (#1031)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [TeaCache]: Add Coefficient Estimation (#940)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI]: Bagel E2E Smoked Test (#1074)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Bump version to 0.14.0 (#1128)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Doc] First stable release of vLLM-Omni (#1129)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Misc] Align error handling with upstream vLLM v0.14.0 (#1122)

Signed-off-by: anna <lee.anna@navercorp.com>
Co-authored-by: anna <lee.anna@navercorp.com>

* [Feature] add Tensor Parallelism to LongCat-Image(-Edit) (#926)

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

* [CI] Temporarily remove slow tests. (#1143)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: princepride <wangzhipeng628@gmail.com>

* [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat (#1165)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Dev/rebase v0.15.0 (#1159)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* Docs update paper link (#1169)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>

* [Debug] Clear Dockerfile.ci to accelerate build image (#1172)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Debug] Correct Unreasonable Long Timeout (#1175)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc]Fix - Align with repo. (#1176)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt (#1170)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] fix qwen image oom (#1168)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [Hardware] Disable compile of diffusion on XPU (#1148)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Doc] Fix vLLM version in user docs (#1179)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

* [Refactor] Refactor async chunk and fix the shape mismatch issue (#1151)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* bugfix: /images/edits endpoint fails pipeline data format check (#1141)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf] resolving prolonged `cudastreamsynchronize` execution in z image processing (#1105)

Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Bugfix] modify RTF use audio_e2e/audio_duration (#1157)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Doc] Highlight paper & slides. (#1186)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [chore] Remove zmq context initialize (#1187)

Signed-off-by: xiedeyantu <czjourney@163.com>

* [NPU] Update Dockerfile and docs for v0.14.0 (#671)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature (#1018)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] opt doc (#1118)

Signed-off-by: David Chen <530634352@qq.com>

* [Bugfix] Fix tp+sp accuracy, incorrect process group mapping (#1178)

Signed-off-by: David Chen <530634352@qq.com>

* [Feature] Enable use_audio_in_video for Qwen 3 Omni Online (#1198)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] async_chunk rebase v0.15.0 (#1195)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [feature]: support flux cache_dit (#1145)

Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

* [CI] Add CI branch coverage calculation,  fix statement coverage results and add log before test for buildkite  log group (#1120)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Wan 2.2][Diffusion] Add TP Support (#964)

Signed-off-by: weichen <calvin_zhu0210@outlook.com>

* [Hardware] [Feat] Setup platform dependent package installation (#1046)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [XPU] Fix XPU UTs for basic coverage (#1164)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Test] Add BuildKite test-full script for full CI. (#867)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock (#1202)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] Fix wan2.2 ti2v (#1221)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Fix '--max-generated-image-size' cli args type (#1249)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Bugfix] Ensure seed=0 is correctly handled in image edit (#1248)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Docs] Add example image download step to Image-To-Video examples (#1258)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode (#1241)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached (#1203)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Doc] Update QA relevant to quantization  (#1257)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix Doc link Rrror (#1263)

Signed-off-by: lishunyang <lishunyang12@163.com>

* Process-Scoped GPU Memory Accounting (#1204)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [ComfyUI]: ComfyUI integration (#1113)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* fix: add diffusion offload args to OmniConfig group instead of serve_parser (#1271)

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

* [Doc] Adding models/pipelines/features Tutorial (#1196)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>

* [CI] Add env variable check for nightly CI  (#1281)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [CI] Add pytest markers to current tests and update the doc. (#577)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design (#1275)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>

* [Feature] Opt metrics structure (#891)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Test] Add example test cases for omni online (#1086)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CI] Reduce the time for Diffusion Sequence Parallelism Test (#1283)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni (#1085)

Signed-off-by: Semmer2 <semmer@live.cn>

* [Chore] Update copyright year. (#1256)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [feature]: support Flux.1-dev CFG-Parallel (#1269)

* [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect (#1254)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Doc] Update Qwen3-TTS docs for consistency with Omni examples (#1226)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Fix]Ensure HuggingFace downloads complete before initialization. (#1213)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Fixed the issue where ignore_eos was not working. (#1286)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Test] Add e2e tests for Qwen3-TTS speech endpoint (#1206)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [Feat]: support VAE patch parallelism (#756)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: hsliuustc0106 <liuhongsheng4@huawei.com>

* [CI] Disable Qwen3-TTS E2E Test in pipeline.yml (#1306)

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>

* [Misc] Add per-request generator_device to online image gen and edit (#1183)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bagel]: Support TP (#1293)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Bugfix] Fix image edit RoPE crash when explicit height/width are provided (#1265)

Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] Sync (#1216)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt (#1288)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Debug] Add trigger to concurrent stage init (#1274)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix][Qwen3-TTS] Fix task type (#1317)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

* Unifying CLI Argument Naming Style (#1309)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download (#1318)

* [CI] Run nightly tests. (#1333)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature]: FP8 Quantization Support for DiT  (#1034)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

* Fix yield token metrics and opt metrics record stats (#1292)

* [Test] L2 & L3 Test Case Stratification Design for Omni Model (#1272)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk (#1246)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update qwen3-omni & qwen2.5-onmi openai client (#1304)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API (#1073)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>

* [Feature] add Tensor Parallelism to SD_3.5 (#1336)

Signed-off-by: GG-li <3226868735@qq.com>

* [Feature]async scheduling to overlap chunk IO and compute (#951)

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] reused metrics to modify the API Server token statistics in Stream Response (#1301)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>

* Refactor CPU Offloading Backend Pattern (#1223)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [DOC] Doc for CI test - Details about five level stucture and some other files. (#1167)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: yenuo26 <410167048@qq.com>

* [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci (#1348)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Misc] wechat image update (#1354)

Signed-off-by: David Chen <530634352@qq.com>

* [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker (#764)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* [Feature][Bugfix] Add CFG feature to Bagel (#1310)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feature]: Diffusion sleep to use process level memory calculation (#1276)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* change qwen3-omni open cudagraph by default (#1352)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [XPU] Update Bagel's flash_attn_varlen_func to fa utils (#1295)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Test] Add Omni Model Performance Benchmark Test (#1321)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: Revert utils change (#1369)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Rebase] Rebase to vllm v0.16.0 (#1357)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* [Test] Fix expansion and example test case for qwen3-omni (#1358)

Signed-off-by: yenuo26 <410167048@qq.com>

* [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 (#1401)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [0.16.0] remove cuda hard-code for Hunyuan Image3 (#1402)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [XPU] Add XPU Dockerfile and related docs (#1162)

Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>

* [Bugfix] Fix Hardcoded Datatypes in Z-image (#1393)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Feature] : Support disaggregated inference pipeline for Qwen3_TTS (#1161)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Add automated PR reviewer bot with GLM integration (#1424)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo (#1382)

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

* [misc] Feature/pr reviewer auto trigger&update model (#1431)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[misc] Feature/pr reviewer auto trigger&update model" (#1432)

* [Doc] Update GPU installation commands (#1434)

* [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 (#1380)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. (#1429)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feat]: add ASCII art logo for vLLM-Omni  (#1430)

* [Bug] [Bagel] Fix kv transfer bug (#1437)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Wang Zhipeng: princepride <wangzhipeng628@gmail.com>

* [CI] Set L2 & L3 tests running conditions. (#1344)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature] vLLM-Omni RDMA connector (#1019)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [Minor][Refactor] Pass seq_token_counts explicitly (#1425)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Extend Diffusion Benchmark script to other backends (#875)

Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Support Stage Based Deployment CLI (#939)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Doc] Optimize vLLM-Omni metrics documentation (#1311)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix]  Forward all vllm-omni serve command parameters to model (#985)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc]: Add bagel single/multi node usage with mooncake document (#1450)

* [Qwen3TTS][Feat] Code2Wav batched decoding (#1426)

Signed-off-by: pablo <pablo@agigo.ai>
Co-authored-by: pablo <pablo@agigo.ai>

* [CI] Remove overwhelming debug log (#1463)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Misc] update wechat image (#1464)

Signed-off-by: David Chen <530634352@qq.com>

* [Doc] Refine Diffusion Tutorial Documents (#1305)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix] Robust Audio Data Handling in _create_audio_choice (#1222)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Bugfix]: Fix merging updated additional information to ensure dict type (#1296)

Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>

* [Model]Add new nextstep_1(Diffusion) model(only T2I) (#612)

Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Add TTS configuration options (#1177)

Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>

* [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video (#1433)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration (#1455)

Signed-off-by: Sangchun Ha <seomk9896@gmail.com>

* [BugFix] process request.num_cached_tokens if it equals to the initial value  (#1468)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) (#1349)

Signed-off-by: jader <yjader@foxmail.com>

* [Test] Reduce Perf test case and fix modify stage config (#1449)

Signed-off-by: yenuo26 <410167048@qq.com>

* [NPU] Upgrade to v0.16.0 (#1375)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… (#1491)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements (#1482)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix tuple/list KV cache extraction crash (#1405)

Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] format lora related docs for the user's end (#1009)

Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Feature] Support Wan2.2 output with irregular shapes (#1279)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Misc] Migrate L1 tests to use pytest-mock (#1315)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix LoRA Scaling on Active Adapters (#1421)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Bugfix] fix record audio generated frame in offline infer (#1312)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>

* [Model] Support OmniGen2 (#513)

Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>

* [Bugfix][Qwen3TTS] (#1289)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Use pull through cache image for H100 pool (#1518)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version (#1500)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix offline text_to_image error from #1009 (#1515)

Signed-off-by: David Chen <530634352@qq.com>

* [XPU] Enable FLASH_ATTN on XPU (#1332)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* Revert gpu_1 job to use regular image (#1521)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [Chore] remove unused logger in omni_diffusion (#531) (#1509)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Qwen3TTS][Feat] Streaming output (#1438)

Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler (#1448)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning (#1435)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Performance]Qwen3-Omni performance optimization (#1378)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Feature] Support HSDP for diffusion models (#1339)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI] fixed CI timeout (#1460)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>

* [Bugfix] Use uds for zmq address if not set --stage-id (#1522)

Signed-off-by: wuhang <wuhang6@huawei.com>

* [BugFix] Restore talker's config (#1524)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>

* [XPU] fix qwen_omni after rebase to v0.16.0 (#1416)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Enable layerwise offload on all hardware (#1492)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* diffusion: enable VAE patch parallel for SD3.5 (#1428)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Perf] GLM Image (#920)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [skip ci][Doc] add design docs for async chunk in qwen3-omni (#962)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder (#1205)

Signed-off-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>

* [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Feature]: Native GGUF Quantization Support for DiT (#1285)

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Add benchmark for `v1/audio/speech` non-streaming (#1408)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Version] Auto generate version using `setuptool_scm` (#1224)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feat] : Support Async chunk cleanup (#1087)

Signed-off-by: Sy03 <1370724210@qq.com>

* [Profiler] Support online profiling (#1136)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

* [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler (#1510)

Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: 齐保元 <qibaoyuan@xiaomi.com>

* [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda (#1488)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [Chore] Cleanup dead code in GGUF DiT code path (#1533)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Doc] Update installation instructions for vllm 0.16.0 (#1505)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc] [skip ci]Sync. (#1363)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [CI][skip ci]Update H100 image link based on #1518 (#1538)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* Fix no embed text spk tokens (#1540)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Debug] Merge vllm pull 35368 (#1534)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Docs] update async chunk docs diagram [skip ci] (#1530)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio (#1554)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [NPU][Bugfix] Align GPU side and recover qwen3-tts (#1564)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix unexpected crash when init OmniDiffusion (#1562)

Signed-off-by: Semmer2 <semmer@live.cn>

* [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. (#1543)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: fix a lot of bug (#1565)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* feat: add HyperCLOVAX-SEED-Omni-8B support

Model files:
- vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline
  (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer
- vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline
  (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec
- vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py:
  thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from
  LLM output; truncate/pad vision codes to 729 (27x27) for decoder

Registry:
- vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and
  HyperCLOVAXAudioPipeline with post-process functions

Stage config:
- vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config
  Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4),
  Stage 2: audio decoder (GPU 5)

Bug fixes for HyperCLOVAX compatibility:
- diffusion/request.py: add extra dict field to OmniDiffusionRequest so
  vision_tokens/audio_tokens from stage input processors reach the pipeline
- entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra before creating request
- entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where
  thinker2vision_decoder/thinker2audio_decoder return [])
- entrypoints/async_omni.py: handle skipped sentinel in _process_single_result
  so text-only requests complete without crashing on Stage 1/2

* fix: correct decoder params and HCX porting fixes

- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50
  (matches OmniServe production defaults; 3.5 caused over-amplified
  autoguidance → shrunken/degraded output images)
- omni_stage.py: skip empty engine inputs for text-only requests
- async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra (audio_tokens/vision_tokens)
- registry.py: HCX Omni diffusion model registration fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes

* fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, OmniServe default)

* feat: add audio decoder Stage 2 to hcx_omni pipeline

- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml
- GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder)
- Add runtime edge 0->2 (thinker -> audio decoder)
- Implement post-generation PCM chunk streaming for audio output
  (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded)

Refs: github.com/vllm-project/vllm-omni/pull/869 (already incorporated)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: vllm version compatibility for HyperCLOVAX audio decoder startup

- config/model.py: try/except fallback for AttentionBackendEnum import
  (vllm.v1.attention.backends.registry absent in older vllm builds)
- pipeline_hyperclovax_audio.py: return actual named_parameters() from
  load_weights() when using MAR checkpoint so diffusers_loader strict
  check passes (weights loaded eagerly in __init__ via MAR extraction)
- qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs
  for check_interleaved_audio_video and merge_interleaved_embeddings
  which are absent in older vllm qwen2_5_omni_thinker; these symbols
  are only exercised by Qwen models, not HyperCLOVAX

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add edge 1→2 and correct model key in hcx_omni.yaml Stage 2

- Add runtime edge from:1 to:2 (required for Stage-2 connector init;
  without it AsyncOrchestrator cannot route to audio decoder at runtime)
- Change model_subdir to model for Stage-2 engine_args to match
  total-poc working reference config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: audio S2S output - handle diffusion outputs in _create_audio_choice

HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output
directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output
like LLM pipelines. Fix three locations:

1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output
   when final_res.outputs is empty (diffusion path).
2. Streaming audio path: same fix for _final_res.outputs[0].
3. Both loops (for output in final_res.outputs): fall back to single
   synthetic choice at index 0 when outputs list is empty.
4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process
   (returns WAV bytes, not tensors like Qwen3-Omni).

Also fixes audio input (A2T) regression: skip diffusion prompt extraction
when mm_data has audio content (added in previous session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: parse WAV bytes with soundfile for uniform PCM chunk streaming

HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header.
The previous byte-offset splitting included the header in the first
chunk, corrupting it. Fix: parse with soundfile to get float32 PCM,
then convert to int16 chunks uniformly regardless of source type
(bytes or tensor).

Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: zero-shot TTS with speaker embedding from input audio

- serving_chat.py: extract last input_audio base64 from request messages
  and inject as ref_audio_b64 into engine_prompt dict
- thinker2audio_decoder: read ref_audio_b64 from prompt and pass as
  ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline)
- hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot)
  which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup

Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis
matching the voice characteristics of the original speaker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: wire audio decoder Stage 2 to hcx_omni pipeline and fix S2S flow

- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml
  with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker
- Fix thinker2audio_decoder: correct audio token range (128606-135167),
  remap to [0, 6561) for BigVGAN input, handle empty token case gracefully
- Fix pipeline_hyperclovax_audio.py post_process_func signature and
  incorporate PR#869 BUG FIX patches for stable audio generation

* fix: use finetuned audio decoder and fix transformers_modules deserialization

- hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot,
  ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id).
  Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable
  for text-only requests and incompatible with finetuned decoder path.

- pipeline_hyperclovax_audio.py: guard ref_audio processing with
  'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder,
  so passing ref_audio bytes would crash with 'expected 100 channels'.

- omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules)
  to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles
  outputs containing custom classes from transformers_modules (trust_remote_code),
  but the API server process doesn't have this path, causing deserialization
  failures that silently drop Stage-0 outputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: restore zero-shot speaker cloning with fallback for text-only requests

- hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN)
  for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed
  integer speaker_id and lost the input speaker's voice.

- pipeline_hyperclovax_audio.py: add zero-mel fallback branch for
  finetune=False + ref_audio=None case. When a text-only request arrives
  (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor
  [1, num_mels, 64] instead of crashing with 'expected 100 channels'.
  S2S requests always have ref_audio so the zero-shot cloning path is
  unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add stage config yaml for HCX audio decoder

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HCX audio decoder pipeline

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: modify exception for HCX audio decoder (GAN)

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: default temperature set to 0, and pipeline model evaluation mode

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

---------

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: dengyunyang <584797741@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Kyle Huang <yellowsea@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: anna <lee.anna@navercorp.com>
Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: xiedeyantu <czjourney@163.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: Semmer2 <semmer@live.cn>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>
Signed-off-by: Sangchun Ha <seomk9896@gmail.com>
Signed-off-by: jader <yjader@foxmail.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: xulusjb <fdukeshik@gmail.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Co-authored-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: kYLe <yellowsea@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: NATURE <wzliu@connect.hku.hk>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
Co-authored-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: Fanli Lin <fanli.lin@intel.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Co-authored-by: Pierre LE GUEN <26087574+PierreLeGuen@users.noreply.github.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>
Co-authored-by: Juan Pablo Zuluaga <46724788+JuanPZuluaga@users.noreply.github.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: ceanna93 <fairyanna@naver.com>
Co-authored-by: anna <lee.anna@navercorp.com>
Co-authored-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Jensen <czjourney@163.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: weichen <calvin_zhu0210@outlook.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Jiaping Wu <53215702+ElleElleWu@users.noreply.github.com>
Co-authored-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: yenuo26 <410167048@qq.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: Sy03 <1370724210@qq.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: UsamaKenway <56207634+UsamaKenway@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: wuhang <wuhang6@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com>
Co-authored-by: Yanick Schraner <yanick.schraner@gmail.com>
Co-authored-by: Sangchun Ha <seomk9896@naver.com>
Co-authored-by: 亦瑾 <76905040+yJader@users.noreply.github.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Yupu <feng.yu.pu0330@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: zhumingjue138 <zhumingjue@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Jared Wen <w13431838023@gmail.com>
Co-authored-by: Xu Lu <572605156@qq.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: Baoyuan Qi <qibaoyuan@xiaomi.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Ting FU <futing10@huawei.com>
Co-authored-by: developer-account <irteam@vllm-omni-dev-0.vllm-omni-dev.p-nb13557.svc.cluster.local>
Co-authored-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants