Skip to content

[Profiler] Support online profiling#1136

Merged
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
gcanlin:profiler-api
Feb 27, 2026
Merged

[Profiler] Support online profiling#1136
hsliuustc0106 merged 15 commits into
vllm-project:mainfrom
gcanlin:profiler-api

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Feb 1, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  • Add start_profile() and stop_profile() async methods that delegate to the underlying DiffusionEngine via thread pool executor
  • Add /start_profile and /stop_profile POST endpoints to the vLLM-Omni API router
  • Endpoints are always available without requiring profiler_config to be set
  • Profiler endpoints are only available when enabling profiler_config or profiler env.

Test Plan

import base64
from openai import OpenAI
import os
client = OpenAI(
    api_key="None",
    base_url="http://localhost:8299/v1"
)

import requests
requests.post("http://localhost:8299/start_profile")

result = client.images.edit(
    model="Qwen/Qwen-Image-Edit-2511",
    image=[
        open("./cp-1.png", "rb"),
        open("./cp-2.png", "rb"),
    ],
    prompt="根据这图1中女性和图2中的男性,生成一组结婚照,并遵循以下描述:新郎穿着红色的中式马褂,新娘穿着精致的秀禾服,头戴金色凤冠。他们并肩站立在古老的朱红色宫墙前,背景是雕花的木窗。光线明亮柔和,构图对称,氛围喜庆而庄重。",
    size='1100x2200',
    stream=False,
    output_format='jpeg',
    extra_body={
        "num_inference_steps": 1,
        "guidance_scale": 0,
    }
)

requests.post("http://localhost:8299/stop_profile")

image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)

# Save the image to a file
with open("cp.jpeg", "wb") as f:
    f.write(image_bytes)

Test Result

[Stage-0] INFO 02-01 15:21:14 [diffusion_engine.py:227] Starting diffusion profiling → /root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274*.json
[Stage-0] INFO 02-01 15:21:14 [torch_profiler.py:49] [Rank 0] Starting End-to-End Torch profiler
/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/profiler/profiler.py:509: UserWarning: Profiler won't be using warmup, this can skew profiler results
  warn("Profiler won't be using warmup, this can skew profiler results")
/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/autograd/profiler.py:271: UserWarning: CUDA is not available, disabling CUDA profiling
  warn("CUDA is not available, disabling CUDA profiling")
[Stage-0] INFO 02-01 15:21:14 [omni_stage.py:1295] [Stage-0] Diffusion Torch profiler started
(APIServer pid=336873) WARNING 02-01 15:21:14 [api_server.py:1024] Model mismatch: request specifies 'Qwen/Qwen-Image-Edit-2511' but server is running 'unknown'. Using server model.
(APIServer pid=336873) INFO 02-01 15:21:14 [api_server.py:1100] Generating 1 image(s) 1100x2200
(APIServer pid=336873) INFO 02-01 15:21:14 [async_omni.py:319] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
....
.....
(APIServer pid=336873) INFO 02-01 15:21:21 [api_server.py:1111] Successfully generated 1 image(s)
(APIServer pid=336873) INFO:     127.0.0.1:58816 - "POST /v1/images/edits HTTP/1.1" 200 OK
(APIServer pid=336873) INFO 02-01 15:21:21 [api_server.py:1438] Stopping profiler...
(APIServer pid=336873) INFO 02-01 15:21:21 [omni.py:430] [AsyncOrchestrator] Requesting profile data collection from stage-0
(APIServer pid=336873) INFO 02-01 15:21:21 [omni_stage.py:219] [Stage-0] Sending PROFILER_STOP to worker...
[Stage-0] INFO 02-01 15:21:21 [diffusion_engine.py:252] Stopping diffusion profiling and collecting results...
[Stage-0] INFO 02-01 15:21:30 [torch_profiler.py:58] [Rank 0] Trace exported to /root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274_rank0.json
[Stage-0] INFO 02-01 15:21:30 [torch_profiler.py:62] [Rank 0] Triggered background compression for /root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274_rank0.json
[Stage-0] INFO 02-01 15:21:30 [diffusion_engine.py:277] [Rank 0] Final trace: /root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274_rank0.json.gz
[Stage-0] INFO 02-01 15:21:30 [diffusion_engine.py:297] Profiling stopped. Collected 1 trace file(s) from 1 rank(s). Final trace paths: /root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274_rank0.json.gz
[Stage-0] INFO 02-01 15:21:30 [omni_stage.py:1311] [Stage-0] Diffusion Torch profiler stopped
[Stage-0] INFO 02-01 15:21:30 [omni_stage.py:1313] Diffusion trace files: {'traces': ['/root/vllm-workspace/vllm-omni/profiles/stage_0_diffusion_1769959274_rank0.json.gz'], 'tables': []}
(APIServer pid=336873) WARNING 02-01 15:21:30 [omni.py:461] [AsyncOrchestrator] Stage-0 returned no table data
(APIServer pid=336873) INFO 02-01 15:21:30 [omni.py:474] [AsyncOrchestrator] Collected 1 trace(s) and 0 table(s)
(APIServer pid=336873) INFO 02-01 15:21:30 [api_server.py:1440] Profiler stopped.
(APIServer pid=336873) INFO:     127.0.0.1:58056 - "POST /stop_profile HTTP/1.1" 200 OK

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b40fff241b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1435 to +1441
@router.post("/stop_profile")
async def stop_profile(raw_request: Request):
"""Stop profiling for the engine."""
logger.info("Stopping profiler...")
await raw_request.app.state.engine_client.stop_profile()
logger.info("Profiler stopped.")
return Response(status_code=200)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return profiler results from stop endpoint

The /stop_profile handler discards the result of engine_client.stop_profile() and always returns an empty 200. However, Omni.stop_profile() returns a dict of trace/table paths (see vllm_omni/entrypoints/omni.py), which is the only way for remote callers to discover where the profiler output was written. As implemented, clients running against a remote API server cannot retrieve or even know the trace locations, which makes the new online profiling endpoint effectively unusable outside of log access. Consider returning that dict in the response (or at least the trace paths) so callers can fetch the artifacts.

Useful? React with 👍 / 👎.

@david6666666
Copy link
Copy Markdown
Collaborator

@ZJY0516 @lishunyang12 PTAL thx

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I will review it now.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused PR. A few issues to address — see inline comments.

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Args:
trace_filename: Optional base filename for trace files.
If None, a timestamp-based name will be generated.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.get_event_loop() is deprecated since Python 3.10. Since this is always called from within an async context, use asyncio.get_running_loop() instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be consistent with other async API in the current codebase. Maybe we should consider leave it to a following-up PR, which replaces asyncio.get_event_loop() by asyncio.get_running_loop() for all.

await handle_profiler_task_async(task_type)
profiler_data = await handle_profiler_task_async(task_type)
# Send result back to orchestrator for STOP command
if task_type == OmniStageTaskType.PROFILER_STOP:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the orchestrator in omni.py set up to handle messages with {"type": "profiler_result"}? If the message dispatch loop doesn't recognize this type, these messages will sit in the queue and could interfere with normal request output routing. Please verify and add the handling on the receiving side if missing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't need to add more handling method for it. It has checked if response.get("type") == "profiler_result" and extracts the data.

if stage_type == "diffusion":
try:
# Sync call is safe here — diffusion profiling is lightweight
profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR", "./profiles")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this will conflict with PR #1098, which removes the trace_filename parameter from start_profile() and makes the profiler config-driven. If #1098 merges first, this trace_filename argument will be silently ignored. Please coordinate merge order.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 14, 2026

@lishunyang12 @hsliuustc0106 I think it's ready now. I will test omni models as well.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@lishunyang12 @hsliuustc0106 I think it's ready now. I will test omni models as well.

yes i think we need to align diffusion and omni models in profiling

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 14, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 16, 2026

Test Qwen3-Omni profiler with the following config:

stage_args:
  - stage_id: 0
    stage_type: llm  # Use llm stage type to launch OmniLLM
    runtime:
      devices: "0,1"
      max_batch_size: 1
    engine_args:
      model_stage: thinker
      model_arch: Qwen3OmniMoeForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.5
      enforce_eager: true
      trust_remote_code: true
      engine_output_type: latent  # Output hidden states for talker
      distributed_executor_backend: "mp"
      enable_prefix_caching: false
      max_num_batched_tokens: 128
      hf_config_name: thinker_config
      tensor_parallel_size: 2
      profiler_config:
        profiler: torch
        torch_profiler_dir: ./test-omni
    final_output: true
    final_output_type: text
    is_comprehension: true
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05

Server:

 [Stage-0] vLLM profiler started
(Worker pid=18095) [Stage-1] INFO 02-16 18:36:54 [mrope.py:345] Multimodal token idx changed!
(Worker pid=18102) [Stage-2] INFO 02-16 18:36:56 [mrope.py:345] Multimodal token idx changed!
(APIServer pid=12252) WARNING 02-16 18:36:56 [protocol.py:51] The following fields were present in the request but ignored: {'reasoning_content'}
(APIServer pid=12252) INFO:     127.0.0.1:50660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=12252) INFO 02-16 18:36:56 [api_server.py:1545] Stopping profiler for stages: [0]
(APIServer pid=12252) INFO 02-16 18:36:56 [omni.py:449] [AsyncOrchestrator] Requesting profile data collection from stage-0
(APIServer pid=12252) INFO 02-16 18:36:56 [omni_stage.py:368] [Stage-0] Sending PROFILER_STOP to worker...
(EngineCore_DP0 pid=17415) [Stage-0] INFO 02-16 18:37:56 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=17415) [Stage-0] INFO 02-16 18:38:56 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=17415) [Stage-0] INFO 02-16 18:39:56 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=17415) [Stage-0] INFO 02-16 18:40:56 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=18098) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(Worker_TP0 pid=18098)                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
(Worker_TP0 pid=18098) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(Worker_TP0 pid=18098)                    execute_context_0(0)_generation_1(1)         0.00%       0.000us         0.00%       0.000us       0.000us        3.645s       121.68%        3.645s     140.198ms            26
(Worker_TP0 pid=18098)                   execute_context_1(12)_generation_0(0)         0.00%       0.000us         0.00%       0.000us       0.000us        2.813s        93.92%        2.813s        2.813s             1
....
(Worker_TP0 pid=18098)                                             aten::fill_         0.16%      10.324ms         0.28%      18.186ms      11.416us       2.641ms         0.09%       2.641ms       1.658us          1593
(Worker_TP0 pid=18098) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.386ms         0.08%       2.386ms       1.733us          1377
(Worker_TP0 pid=18098)                                              aten::sort         0.01%     956.367us         0.04%       2.868ms     106.213us       1.798ms         0.06%       1.865ms      69.081us            27
(Worker_TP0 pid=18098)                                 Activity Buffer Request         0.08%       5.505ms         0.08%       5.507ms       1.377ms       1.425ms         0.05%       1.425ms     356.258us             4
(Worker_TP0 pid=18098) ## Call CompiledFxGraph f7augta4bnvjagednzipfktriod3...         0.00%       0.000us         0.00%       0.000us       0.000us       1.390ms         0.05%       1.390ms      53.475us            26
(Worker_TP0 pid=18098) void at_cuda_detail::cub::DeviceRadixSortOnesweepKer...         0.00%       0.000us         0.00%       0.000us       0.000us       1.206ms         0.04%       1.206ms      11.168us           108
(Worker_TP0 pid=18098)                                    cudaEventSynchronize         0.01%     518.398us         0.01%     520.637us       6.428us       1.102ms         0.04%       1.102ms      13.608us            81
(Worker_TP0 pid=18098)                     cudaThreadExchangeStreamCaptureMode         1.33%      86.375ms         1.33%      86.409ms       1.543ms       1.049ms         0.04%       1.052ms      18.787us            56
(Worker_TP0 pid=18098)                                         Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us     337.215us         0.01%     337.215us       1.784us           189
(Worker_TP0 pid=18098)                                            aten::argmax         0.01%     438.476us         0.01%     736.349us      27.272us     315.938us         0.01%     315.938us      11.701us            27
(Worker_TP0 pid=18098)                                               aten::mul         0.03%       1.762ms         1.62%     105.088ms     890.578us     309.858us         0.01%     309.858us       2.626us           118
(Worker_TP0 pid=18098) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     309.858us         0.01%     309.858us       5.738us            54
(Worker_TP0 pid=18098) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     287.585us         0.01%     287.585us      10.651us            27
(Worker_TP0 pid=18098)                                           aten::scatter         0.01%     335.667us         0.01%     854.441us      31.646us     278.498us         0.01%     345.890us      12.811us            27
(Worker_TP0 pid=18098) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us     278.498us         0.01%     278.498us      10.315us            27
(Worker_TP0 pid=18098) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     223.426us         0.01%     223.426us       4.138us            54
(Worker_TP0 pid=18098)                                      aten::index_select         0.01%     856.108us         0.03%       1.902ms      23.479us     217.951us         0.01%     228.574us       2.822us            81
(Worker_TP0 pid=18098) void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us     217.951us         0.01%     217.951us       4.036us            54
(Worker_TP0 pid=18098) void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us     211.330us         0.01%     211.330us       1.975us           107
(Worker_TP0 pid=18098)                                              aten::div_         0.01%     334.918us         0.01%     604.563us      11.196us     208.541us         0.01%     208.541us       3.862us            54
(Worker_TP0 pid=18098)                                            aten::cumsum         0.01%     474.328us         0.01%     774.223us      28.675us     185.347us         0.01%     185.347us       6.865us            27
(Worker_TP0 pid=18098) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     183.326us         0.01%     183.326us       3.395us            54
(Worker_TP0 pid=18098)                                               aten::add         0.03%       1.669ms         0.59%      38.081ms     369.721us     180.258us         0.01%     180.258us       1.750us           103
(Worker_TP0 pid=18098) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     180.258us         0.01%     180.258us       2.225us            81
....
(Worker_TP0 pid=18098)                                                aten::le         0.00%     191.929us         0.00%     301.941us      11.183us     109.763us         0.00%     109.763us       4.065us            27
(Worker_TP0 pid=18098) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      98.462us         0.00%      98.462us       3.647us            27
(Worker_TP0 pid=18098)                                   cudaStreamIsCapturing         0.03%       2.012ms         0.03%       2.012ms       0.690us      96.321us         0.00%      96.321us       0.033us          2916
(Worker_TP0 pid=18098)                         _C::apply_repetition_penalties_         0.00%     150.498us         2.61%     169.916ms       6.293ms      92.735us         0.00%     103.007us       3.815us            27
(Worker_TP0 pid=18098) void vllm::apply_repetition_penalties_kernel<float>(...         0.00%       0.000us         0.00%       0.000us       0.000us      92.735us         0.00%      92.735us       3.435us            27
(Worker_TP0 pid=18098) void vllm::rms_norm_kernel<c10::BFloat16, 8, 2>(c10:...         0.00%       0.000us         0.00%       0.000us       0.000us      89.314us         0.00%      89.314us       3.308us            27
(Worker_TP0 pid=18098)                                   Lazy Function Loading         0.01%     855.297us         0.01%     855.297us      50.312us      84.382us         0.00%      84.382us       4.964us            17
(Worker_TP0 pid=18098)                                            aten::gather         0.00%     292.156us         0.01%     442.666us      16.395us      78.434us         0.00%      78.434us       2.905us            27
(Worker_TP0 pid=18098) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      78.434us         0.00%      78.434us       2.905us            27
(Worker_TP0 pid=18098) at::native::(anonymous namespace)::fill_reverse_indi...         0.00%       0.000us         0.00%       0.000us       0.000us      77.247us         0.00%      77.247us       2.861us            27
(Worker_TP0 pid=18098) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      71.038us         0.00%      71.038us       2.631us            27
(Worker_TP0 pid=18098) void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us      64.605us         0.00%      64.605us       2.485us            26
(Worker_TP0 pid=18098) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(Worker_TP0 pid=18098) Self CPU time total: 6.503s
(Worker_TP0 pid=18098) Self CUDA time total: 2.996s
(Worker_TP0 pid=18098)
(Worker_TP0 pid=18098) [Stage-0] INFO 02-16 18:41:51 [wrapper.py:66] Profiler stopped successfully.
[Stage-0] INFO 02-16 18:41:51 [omni_stage.py:1187] [Stage-0] vLLM profiler stopped
(APIServer pid=12252) WARNING 02-16 18:41:51 [omni.py:480] [AsyncOrchestrator] Stage-0 returned no table data
(APIServer pid=12252) INFO 02-16 18:41:51 [omni.py:493] [AsyncOrchestrator] Collected 0 trace(s) and 0 table(s)
(APIServer pid=12252) INFO 02-16 18:41:51 [api_server.py:1552] Profiler stopped.
(APIServer pid=12252) INFO:     127.0.0.1:50662 - "POST /stop_profile HTTP/1.1" 200 OK

Client code:

#!/usr/bin/env python3
"""Script to profile a single request on Qwen3-Omni server.

Usage:
    python scripts/profile_request.py --host localhost --port 8000
"""

import argparse
import requests
import json


def start_profile(base_url: str, stages: list[int] | None = None) -> dict:
    """Start profiling for specified stages."""
    url = f"{base_url}/start_profile"
    payload = {"stages": stages} if stages else {}
    response = requests.post(url, json=payload if payload else None)
    response.raise_for_status()
    print(f"[+] Started profiling for stages: {stages if stages else 'all'}")
    return response.json() if response.text else {}


def stop_profile(base_url: str, stages: list[int] | None = None) -> dict:
    """Stop profiling for specified stages."""
    url = f"{base_url}/stop_profile"
    payload = {"stages": stages} if stages else {}
    response = requests.post(url, json=payload if payload else None)
    response.raise_for_status()
    print(f"[+] Stopped profiling for stages: {stages if stages else 'all'}")
    return response.json() if response.text else {}


def send_chat_request(base_url: str, message: str) -> dict:
    """Send a chat completion request."""
    url = f"{base_url}/v1/chat/completions"
    payload = {
        "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
        "messages": [
            {"role": "user", "content": message}
        ],
        "max_tokens": 256,
        "temperature": 0.7,
    }
    print(f"[*] Sending chat request: {message[:50]}...")
    response = requests.post(url, json=payload)
    response.raise_for_status()
    result = response.json()
    print(f"[+] Response received")
    return result


def main():
    parser = argparse.ArgumentParser(description="Profile a single request on Qwen3-Omni")
    parser.add_argument("--host", default="localhost", help="Server host")
    parser.add_argument("--port", type=int, default=8000, help="Server port")
    parser.add_argument("--stages", type=int, nargs="*", default=[0],
                        help="Stage IDs to profile (default: [0])")
    parser.add_argument("--message", default="Hello, how are you?",
                        help="Message to send for the request")
    args = parser.parse_args()

    base_url = f"http://{args.host}:{args.port}"
    stages = args.stages if args.stages else None

    print(f"[*] Connecting to {base_url}")
    print(f"[*] Profiling stages: {stages if stages else 'all'}")
    print("-" * 50)

    try:
        # 1. Start profiling
        start_result = start_profile(base_url, stages)
        if start_result:
            print(f"    Result: {json.dumps(start_result, indent=2)}")

        # 2. Send one request
        print("-" * 50)
        chat_result = send_chat_request(base_url, args.message)
        if "choices" in chat_result and chat_result["choices"]:
            content = chat_result["choices"][0].get("message", {}).get("content", "")
            print(f"    Assistant: {content[:200]}...")

        # 3. Stop profiling
        print("-" * 50)
        stop_result = stop_profile(base_url, stages)
        if stop_result:
            print(f"    Result: {json.dumps(stop_result, indent=2)}")

        print("-" * 50)
        print("[+] Profiling complete! Check ./test-omni for trace files.")

    except requests.exceptions.ConnectionError:
        print(f"[!] Error: Could not connect to {base_url}")
        print("    Make sure the server is running.")
    except requests.exceptions.HTTPError as e:
        print(f"[!] HTTP Error: {e}")
        print(f"    Response: {e.response.text if e.response else 'N/A'}")


if __name__ == "__main__":
    main()

Client log:

 python scripts/profile_request.py \
  --host localhost \
  --port 8091 \
  --stages 0 \
  --message "Tell me a joke"
[*] Connecting to http://localhost:8091
[*] Profiling stages: [0]
--------------------------------------------------
[+] Started profiling for stages: [0]
--------------------------------------------------
[*] Sending chat request: Tell me a joke...
[+] Response received
    Assistant: Sure! Here's one for you:

Why don't skeletons fight each other?

Because they don't have the guts! 😄...
--------------------------------------------------
[+] Stopped profiling for stages: [0]
--------------------------------------------------
[+] Profiling complete! Check ./test-omni for trace files.

@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Feb 20, 2026

LGTM, please solve the CI. @hsliuustc0106 We can merge this PR first before #1098.

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 20, 2026

LGTM, please solve the CI. @hsliuustc0106 We can merge this PR first before #1098.

Thanks! The error of CI seems to be unrelated to this PR.

@hsliuustc0106 hsliuustc0106 removed the ready label to trigger buildkite CI label Feb 20, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 23, 2026

@hsliuustc0106 May I ask if this PR is ready to be merged?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link
Copy Markdown

🤖 VLLM-Omni PR Review

Code Review: [Profiler] Support online profiling for diffusion

1. Overview

This PR adds online profiling support for diffusion models in VLLM-Omni by:

  • Adding async start_profile() and stop_profile() methods to AsyncOmniDiffusion that delegate to the underlying engine via thread pool executor
  • Modifying handle_profiler_task_async in omni_stage.py to properly await diffusion profiling calls and return results
  • Adding /start_profile and /stop_profile POST endpoints to the API router with optional stage selection

Overall Assessment: Positive - The implementation is clean, follows existing patterns, and includes proper test verification.

2. Code Quality

Strengths

  • Proper use of asyncio.run_in_executor for delegating blocking calls to thread pool
  • Good type hints and docstrings
  • Consistent error handling with try/except blocks and logging

Issues Found

Potential Bug - Missing return in exception handlers:
vllm_omni/entrypoints/omni_stage.py:1168-1171

except Exception as e:
    logger.warning("[Stage-%s] Failed to start diffusion profiler: %s", stage_id, e)
# No return statement here - falls through to return {} at end

While this works due to the return {} at the end, it's cleaner to explicitly return in exception handlers for clarity.

Inconsistent async handling:
vllm_omni/entrypoints/omni_stage.py:1173-1176

The LLM profiler path doesn't use await while the diffusion path does. This is intentional but should be documented with a comment explaining why.

3. Architecture & Design

Strengths

  • Clean separation between API layer and engine layer
  • Proper use of async patterns for I/O-bound operations
  • Good backward compatibility consideration for non-Omni engines

Suggestions

Consider adding a profiler state check:
The endpoints don't check if profiling is already started/stopped. Calling start_profile twice could lead to unexpected behavior.

# In api_server.py, consider:
@router.post("/start_profile")
async def start_profile(raw_request: Request, request: ProfileRequest | None = None):
    # Could add state tracking to prevent double-start

4. Security & Safety

Observations

  • No authentication/authorization on profiling endpoints (consistent with existing API design)
  • No input validation on trace_filename parameter - could potentially be a path traversal concern if user-controllable

Recommendation: Consider validating or sanitizing the trace filename if it ever becomes user-controllable in the future.

5. Testing & Documentation

Strengths

  • Comprehensive test plan with working code example
  • Test results demonstrate successful profiling workflow
  • Clear log output showing the profiling lifecycle

Suggestions

  • Consider adding unit tests for the new async methods
  • The PR description checklist items are not checked - please update before merging

6. Specific Suggestions

vllm_omni/entrypoints/async_omni_diffusion.py:311-322

Consider adding type annotation for the executor parameter:

async def start_profile(self, trace_filename: str | None = None) -> None:
    """Start profiling for the diffusion model.

    Args:
        trace_filename: Optional base filename for trace files.
                       If None, a timestamp-based name will be generated.
    """
    loop = asyncio.get_running_loop()  # Prefer get_running_loop() over get_event_loop()
    await loop.run_in_executor(
        self._executor,
        self.engine.start_profile,
        trace_filename,
    )

vllm_omni/entrypoints/omni_stage.py:1153

Add explicit return in exception handlers for clarity:

async def handle_profiler_task_async(task_type: OmniStageTaskType) -> dict:
    """Handle profiler task asynchronously for both LLM and diffusion stages."""
    if task_type == OmniStageTaskType.PROFILER_START:
        if stage_type == "diffusion":
            try:
                profile_dir = os.environ.get("VLLM_TORCH_PROFILER_DIR", "./profiles")
                os.makedirs(profile_dir, exist_ok=True)
                trace_filename = f"stage_{stage_id}_diffusion_{int(time.time())}"
                await stage_engine.start_profile(trace_filename=trace_filename)
                logger.info("[Stage-%s] Diffusion Torch profiler started", stage_id)
            except Exception as e:
                logger.warning("[Stage-%s] Failed to start diffusion profiler: %s", stage_id, e)
            return {}  # Add explicit return here

vllm_omni/entrypoints/openai/api_server.py:1522-1524

The conditional logic can be simplified:

# Current:
if stages is not None:
    result = await engine_client.start_profile(stages=stages)
else:
    result = await engine_client.start_profile()

# Could be simplified if engine_client.start_profile accepts None:
result = await engine_client.start_profile(stages=stages)

7. Approval Status

LGTM with suggestions

The PR is well-implemented and tested. The suggestions above are minor improvements that don't block merging. The core functionality works correctly as demonstrated by the test results.

Before merging, please:

  1. Update the PR description checklist
  2. Consider the minor code quality suggestions above (especially using get_running_loop() instead of get_event_loop())

This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@gcanlin check the glm comments

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 23, 2026

@gcanlin check the glm comments

Accept the third suggestion from GLM.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@gcanlin gcanlin changed the title [Profiler] Support online profiling for diffusion [Profiler] Support online profiling Feb 24, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoints are always available without requiring profiler_config to be set

This is a big no-no for prod use-cases, as bad actors can exploit this endpoint to hog server. Something as naive as #1451 should be ok for now, given endpoint is geared towards dev.

Copy link
Copy Markdown
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just enforcing point above

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 24, 2026

Endpoints are always available without requiring profiler_config to be set

This is a big no-no for prod use-cases, as bad actors can exploit this endpoint to hog server. Something as naive as #1451 should be ok for now, given endpoint is geared towards dev.

I add the new router for profiler. Just like vLLM's way:

def attach_router(app: FastAPI):
    profiler_config = getattr(app.state.args, "profiler_config", None)
    assert profiler_config is None or isinstance(profiler_config, ProfilerConfig)
    if profiler_config is not None and profiler_config.profiler is not None:
        logger.warning_once(
            "Profiler with mode '%s' is enabled in the "
            "API server. This should ONLY be used for local development!",
            profiler_config.profiler,
        )
        app.include_router(router)

#1261 will refactor the old env into CLI args. Before that, the check for profiler router has to be compatible with two ways.

Please review again. Thanks!

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Comment thread vllm_omni/entrypoints/openai/api_server.py
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Feb 25, 2026
return True

# TODO: remove this env after refactoring torch profiler to CLI args
env_value = os.environ.get("VLLM_TORCH_PROFILER_DIR")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VLLM repository uses profiler-config. After 985 was merged, will there be any issues using the VLLM_TORCH_PROFILER_DIR environment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not really sure. For now, we have two profiler ways:

  1. Diffusion profiler owned by vLLM-Omni totally.(But it still use env).
  2. Reuse vLLM's profiler for omni pipeline models.(which has upgraded to vLLM's profiler config way).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm fully moved away from VLLM_TORCH_PROFILER_DIR in latest HEAD, I think we should just commit to one way if we're aiming to land this PR with current vLLM version.
ow we have to start with both VLLM_TORCH_PROFILER_DIR vllm serve .. --profiler-config

return True

# TODO: remove this env after refactoring torch profiler to CLI args
env_value = os.environ.get("VLLM_TORCH_PROFILER_DIR")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm fully moved away from VLLM_TORCH_PROFILER_DIR in latest HEAD, I think we should just commit to one way if we're aiming to land this PR with current vLLM version.
ow we have to start with both VLLM_TORCH_PROFILER_DIR vllm serve .. --profiler-config

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
@gxxx-hum
Copy link
Copy Markdown
Contributor

Which version of vllm does this pr depend on?

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 26, 2026

Which version of vllm does this pr depend on?

You can use it with vLLM 0.16.0

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 26, 2026

@hsliuustc0106 This PR is ready to merge.

Copy link
Copy Markdown
Contributor

@Shirley125 Shirley125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please update the document about how to use online profiling. btw, can I also profile omni model of stage1 or stage 2 using profiler-config?

Comment thread docs/contributing/profiling.md
@hsliuustc0106 hsliuustc0106 merged commit eaef3a3 into vllm-project:main Feb 27, 2026
7 checks passed
@gxxx-hum
Copy link
Copy Markdown
Contributor

The final generated trace file appears to lack GPU-based data collection information, showing only CPU-side data.
image

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 28, 2026

The final generated trace file appears to lack GPU-based data collection information, showing only CPU-side data.

Are you profiling on NPU? For diffusion models, you could use #1542 temporarily.

@gxxx-hum
Copy link
Copy Markdown
Contributor

Are you profiling on NPU? For diffusion models, you could use #1542 temporarily.

A800,not NPU

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Feb 28, 2026

@dazedsanjin Could you try #1261 with main branch? I have also updated docs in the PR. If the problem still exists, feel free to open an issue or let me know. Thanks!

@gxxx-hum
Copy link
Copy Markdown
Contributor

feel free to open an issue or let me know. Thanks!

I'll try it

with1015 added a commit to with1015/vllm-omni that referenced this pull request Apr 6, 2026
* [Frontend][Model] Support batch request with refined OmniDiffusionReq… (#797)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* [Model]: add FLUX.1-dev model (#853)

* [BugFix] ignore mm data from stages to async omni (#954)

Signed-off-by: dengyunyang <584797741@qq.com>

* Revert "[BugFix] ignore mm data from stages to async omni" (#1023)

* [Bugfix] Modify output to model_runner_output (#1026)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Feature] Support cache-dit for Wan 2.2 inference (#1021)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>

* [Doc]Format profiling doc (#993)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Hardware] Support platforms and plugin system (#774)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Core]: KV Cache Transfer Encapsulation (#979)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Test]Delete skip mark for amd ci test and fix CI failure (#927)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Doc]Specify Qwen3-TTS model name for each task type (#1036)

Signed-off-by: Kyle Huang <yellowsea@gmail.com>

* [Misc] pin version of fa3-fwd (#1051)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [CI] [ROCm] Add more AMD CI tests (#1039)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix qwen image layerd in dummy run (#1027)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [BugFix] Fix noisy output without setting a seed in Qwen Image (#1043)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [bugfix] remove vllm speech route (#1060)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Debug] Update GLM-Image Pipeline (#1049)

Co-authored-by: root <root@hk01dgx028.cm.cluster>

* [Diffusion][Bugfix] Fix the flash_attn backends selection logic (#983)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix the accuracy issue of multimodal input. (#1020)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>

* [Bugfix] Set VaeImageProcessor `do_convert_rgb` True (#1032)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [feat]: adapt batch request for flux (#1028)

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

* [CI] Change Qwen3 Omni stage placement strategy  (#1072)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

* [BugFix] Fix to use correct attn backend (#1038)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [Perf] Qwen3 Omni talker mtp optimization (#1005)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Wan2.2] Optimize memory usage with conditional transformer loading (#980)

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Feat] Support XPU Backend in vLLM-Omni (#191)

Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Fix] stabilize diffusion images LoRA E2E across CI drift (#1075)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Bugfix][Test] Re-enable the log simple tests (#1065)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni (#1025)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (#987)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>

* [Fix] Increase max wait time for server readiness to accommodate model loading (#1089)

Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Benchmark] Add vLLM-Omni Omni model online benchmark (#780)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Remove Mooncake/Yuanrong connector import warning (#1091)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* fix: UnboundLocalError for role in streaming audio/image responses (#784)

Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>

* [Misc] update wechat image (#1096)

* [Feature] Support DiT Layerwise (Blockwise) CPU Offloading (#858)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Modify max_tokens and modify the log and fix #1103 (#1097)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task (#1100)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Add supports_torch_inductor interface (#1108)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error (#1104)

Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] fix request_id of image generation in api server (#1112)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf]: CFG parallel abstraction (#851)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#1082)

* [CI] [ROCm] Quick fix amd ci (#1116)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix benchmark audio timing error and add benchmark test (#1109)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration (#1079)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

* [NPU] Align with GPUModelRunner (#1114)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [FEATURE] /v1/images/edit interface (#1101)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Bugfix] Fix NPU SDPA attention mask shape and semantics (#1031)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [TeaCache]: Add Coefficient Estimation (#940)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI]: Bagel E2E Smoked Test (#1074)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Bump version to 0.14.0 (#1128)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Doc] First stable release of vLLM-Omni (#1129)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [Misc] Align error handling with upstream vLLM v0.14.0 (#1122)

Signed-off-by: anna <lee.anna@navercorp.com>
Co-authored-by: anna <lee.anna@navercorp.com>

* [Feature] add Tensor Parallelism to LongCat-Image(-Edit) (#926)

Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>

* [CI] Temporarily remove slow tests. (#1143)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: princepride <wangzhipeng628@gmail.com>

* [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat (#1165)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Dev/rebase v0.15.0 (#1159)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* Docs update paper link (#1169)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>

* [Debug] Clear Dockerfile.ci to accelerate build image (#1172)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Debug] Correct Unreasonable Long Timeout (#1175)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc]Fix - Align with repo. (#1176)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt (#1170)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] fix qwen image oom (#1168)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [Hardware] Disable compile of diffusion on XPU (#1148)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Doc] Fix vLLM version in user docs (#1179)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

* [Refactor] Refactor async chunk and fix the shape mismatch issue (#1151)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* bugfix: /images/edits endpoint fails pipeline data format check (#1141)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Perf] resolving prolonged `cudastreamsynchronize` execution in z image processing (#1105)

Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Bugfix] modify RTF use audio_e2e/audio_duration (#1157)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Doc] Highlight paper & slides. (#1186)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [chore] Remove zmq context initialize (#1187)

Signed-off-by: xiedeyantu <czjourney@163.com>

* [NPU] Update Dockerfile and docs for v0.14.0 (#671)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature (#1018)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] opt doc (#1118)

Signed-off-by: David Chen <530634352@qq.com>

* [Bugfix] Fix tp+sp accuracy, incorrect process group mapping (#1178)

Signed-off-by: David Chen <530634352@qq.com>

* [Feature] Enable use_audio_in_video for Qwen 3 Omni Online (#1198)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] async_chunk rebase v0.15.0 (#1195)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [feature]: support flux cache_dit (#1145)

Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

* [CI] Add CI branch coverage calculation,  fix statement coverage results and add log before test for buildkite  log group (#1120)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Wan 2.2][Diffusion] Add TP Support (#964)

Signed-off-by: weichen <calvin_zhu0210@outlook.com>

* [Hardware] [Feat] Setup platform dependent package installation (#1046)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [XPU] Fix XPU UTs for basic coverage (#1164)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Test] Add BuildKite test-full script for full CI. (#867)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

* [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock (#1202)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bugfix] Fix wan2.2 ti2v (#1221)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Fix '--max-generated-image-size' cli args type (#1249)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Bugfix] Ensure seed=0 is correctly handled in image edit (#1248)

Signed-off-by: ApsarasX <apsarax@outlook.com>

* [Docs] Add example image download step to Image-To-Video examples (#1258)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode (#1241)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached (#1203)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [Doc] Update QA relevant to quantization  (#1257)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [Bugfix] Fix Doc link Rrror (#1263)

Signed-off-by: lishunyang <lishunyang12@163.com>

* Process-Scoped GPU Memory Accounting (#1204)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>

* [ComfyUI]: ComfyUI integration (#1113)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

* fix: add diffusion offload args to OmniConfig group instead of serve_parser (#1271)

Signed-off-by: Chenguang ZHENG <645327136@qq.com>

* [Doc] Adding models/pipelines/features Tutorial (#1196)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>

* [CI] Add env variable check for nightly CI  (#1281)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [CI] Add pytest markers to current tests and update the doc. (#577)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design (#1275)

Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>

* [Feature] Opt metrics structure (#891)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Test] Add example test cases for omni online (#1086)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CI] Reduce the time for Diffusion Sequence Parallelism Test (#1283)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni (#1085)

Signed-off-by: Semmer2 <semmer@live.cn>

* [Chore] Update copyright year. (#1256)

Signed-off-by: lishunyang <lishunyang12@163.com>

* [feature]: support Flux.1-dev CFG-Parallel (#1269)

* [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect (#1254)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* [Doc] Update Qwen3-TTS docs for consistency with Omni examples (#1226)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Fix]Ensure HuggingFace downloads complete before initialization. (#1213)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [BugFix] Fixed the issue where ignore_eos was not working. (#1286)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Test] Add e2e tests for Qwen3-TTS speech endpoint (#1206)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [Feat]: support VAE patch parallelism (#756)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: hsliuustc0106 <liuhongsheng4@huawei.com>

* [CI] Disable Qwen3-TTS E2E Test in pipeline.yml (#1306)

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>

* [Misc] Add per-request generator_device to online image gen and edit (#1183)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Bagel]: Support TP (#1293)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Bugfix] Fix image edit RoPE crash when explicit height/width are provided (#1265)

Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] Sync (#1216)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt (#1288)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Debug] Add trigger to concurrent stage init (#1274)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix][Qwen3-TTS] Fix task type (#1317)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

* Unifying CLI Argument Naming Style (#1309)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download (#1318)

* [CI] Run nightly tests. (#1333)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature]: FP8 Quantization Support for DiT  (#1034)

Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

* Fix yield token metrics and opt metrics record stats (#1292)

* [Test] L2 & L3 Test Case Stratification Design for Omni Model (#1272)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk (#1246)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update qwen3-omni & qwen2.5-onmi openai client (#1304)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API (#1073)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>

* [Feature] add Tensor Parallelism to SD_3.5 (#1336)

Signed-off-by: GG-li <3226868735@qq.com>

* [Feature]async scheduling to overlap chunk IO and compute (#951)

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] reused metrics to modify the API Server token statistics in Stream Response (#1301)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>

* Refactor CPU Offloading Backend Pattern (#1223)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [DOC] Doc for CI test - Details about five level stucture and some other files. (#1167)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: yenuo26 <410167048@qq.com>

* [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci (#1348)

Signed-off-by: dengyunyang <584797741@qq.com>

* [Misc] wechat image update (#1354)

Signed-off-by: David Chen <530634352@qq.com>

* [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker (#764)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* [Feature][Bugfix] Add CFG feature to Bagel (#1310)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feature]: Diffusion sleep to use process level memory calculation (#1276)

Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* change qwen3-omni open cudagraph by default (#1352)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [XPU] Update Bagel's flash_attn_varlen_func to fa utils (#1295)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Test] Add Omni Model Performance Benchmark Test (#1321)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: Revert utils change (#1369)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* [Rebase] Rebase to vllm v0.16.0 (#1357)

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* [Test] Fix expansion and example test case for qwen3-omni (#1358)

Signed-off-by: yenuo26 <410167048@qq.com>

* [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 (#1401)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [0.16.0] remove cuda hard-code for Hunyuan Image3 (#1402)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [XPU] Add XPU Dockerfile and related docs (#1162)

Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>

* [Bugfix] Fix Hardcoded Datatypes in Z-image (#1393)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Feature] : Support disaggregated inference pipeline for Qwen3_TTS (#1161)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Add automated PR reviewer bot with GLM integration (#1424)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo (#1382)

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

* [misc] Feature/pr reviewer auto trigger&update model (#1431)

Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[misc] Feature/pr reviewer auto trigger&update model" (#1432)

* [Doc] Update GPU installation commands (#1434)

* [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 (#1380)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. (#1429)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>

* [Feat]: add ASCII art logo for vLLM-Omni  (#1430)

* [Bug] [Bagel] Fix kv transfer bug (#1437)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Wang Zhipeng: princepride <wangzhipeng628@gmail.com>

* [CI] Set L2 & L3 tests running conditions. (#1344)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* [Feature] vLLM-Omni RDMA connector (#1019)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

* [Minor][Refactor] Pass seq_token_counts explicitly (#1425)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Misc] Extend Diffusion Benchmark script to other backends (#875)

Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Feature] Support Stage Based Deployment CLI (#939)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Doc] Optimize vLLM-Omni metrics documentation (#1311)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix]  Forward all vllm-omni serve command parameters to model (#985)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc]: Add bagel single/multi node usage with mooncake document (#1450)

* [Qwen3TTS][Feat] Code2Wav batched decoding (#1426)

Signed-off-by: pablo <pablo@agigo.ai>
Co-authored-by: pablo <pablo@agigo.ai>

* [CI] Remove overwhelming debug log (#1463)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Misc] update wechat image (#1464)

Signed-off-by: David Chen <530634352@qq.com>

* [Doc] Refine Diffusion Tutorial Documents (#1305)

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

* [Bugfix] Robust Audio Data Handling in _create_audio_choice (#1222)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Bugfix]: Fix merging updated additional information to ensure dict type (#1296)

Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>

* [Model]Add new nextstep_1(Diffusion) model(only T2I) (#612)

Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Add TTS configuration options (#1177)

Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>

* [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video (#1433)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration (#1455)

Signed-off-by: Sangchun Ha <seomk9896@gmail.com>

* [BugFix] process request.num_cached_tokens if it equals to the initial value  (#1468)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) (#1349)

Signed-off-by: jader <yjader@foxmail.com>

* [Test] Reduce Perf test case and fix modify stage config (#1449)

Signed-off-by: yenuo26 <410167048@qq.com>

* [NPU] Upgrade to v0.16.0 (#1375)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… (#1491)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements (#1482)

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix tuple/list KV cache extraction crash (#1405)

Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc] format lora related docs for the user's end (#1009)

Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>

* [Feature] Support Wan2.2 output with irregular shapes (#1279)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [Misc] Migrate L1 tests to use pytest-mock (#1315)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

* [Bugfix] Fix LoRA Scaling on Active Adapters (#1421)

Signed-off-by: Alex Brooks <albrooks@redhat.com>

* [Bugfix] fix record audio generated frame in offline infer (#1312)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>

* [Model] Support OmniGen2 (#513)

Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>

* [Bugfix][Qwen3TTS] (#1289)

Signed-off-by: pablo <juanz9312@gmail.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Use pull through cache image for H100 pool (#1518)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version (#1500)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Bugfix] fix offline text_to_image error from #1009 (#1515)

Signed-off-by: David Chen <530634352@qq.com>

* [XPU] Enable FLASH_ATTN on XPU (#1332)

Signed-off-by: Yan Ma <yan.ma@intel.com>

* Revert gpu_1 job to use regular image (#1521)

Signed-off-by: Kevin H. Luu <khluu000@gmail.com>

* [Chore] remove unused logger in omni_diffusion (#531) (#1509)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>

* [Qwen3TTS][Feat] Streaming output (#1438)

Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler (#1448)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning (#1435)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: Samit <285365963@qq.com>

* [Performance]Qwen3-Omni performance optimization (#1378)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

* [Feature] Support HSDP for diffusion models (#1339)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [CI] fixed CI timeout (#1460)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>

* [Bugfix] Use uds for zmq address if not set --stage-id (#1522)

Signed-off-by: wuhang <wuhang6@huawei.com>

* [BugFix] Restore talker's config (#1524)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>

* [XPU] fix qwen_omni after rebase to v0.16.0 (#1416)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Platform] Enable layerwise offload on all hardware (#1492)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* diffusion: enable VAE patch parallel for SD3.5 (#1428)

Signed-off-by: dongbo910220 <1275604947@qq.com>

* [Perf] GLM Image (#920)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [skip ci][Doc] add design docs for async chunk in qwen3-omni (#962)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder (#1205)

Signed-off-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>

* [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750)

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Feature]: Native GGUF Quantization Support for DiT (#1285)

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* Add benchmark for `v1/audio/speech` non-streaming (#1408)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

* [Version] Auto generate version using `setuptool_scm` (#1224)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Feat] : Support Async chunk cleanup (#1087)

Signed-off-by: Sy03 <1370724210@qq.com>

* [Profiler] Support online profiling (#1136)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>

* [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler (#1510)

Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: 齐保元 <qibaoyuan@xiaomi.com>

* [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda (#1488)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

* [Chore] Cleanup dead code in GGUF DiT code path (#1533)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Doc] Update installation instructions for vllm 0.16.0 (#1505)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Doc] [skip ci]Sync. (#1363)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

* [CI][skip ci]Update H100 image link based on #1518 (#1538)

Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>

* Fix no embed text spk tokens (#1540)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

* [Debug] Merge vllm pull 35368 (#1534)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Docs] update async chunk docs diagram [skip ci] (#1530)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

* fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio (#1554)

Signed-off-by: linyueqian <linyueqian@outlook.com>

* [NPU][Bugfix] Align GPU side and recover qwen3-tts (#1564)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

* [BugFix] Fix unexpected crash when init OmniDiffusion (#1562)

Signed-off-by: Semmer2 <semmer@live.cn>

* [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. (#1543)

Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

* [BugFix]: fix a lot of bug (#1565)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* feat: add HyperCLOVAX-SEED-Omni-8B support

Model files:
- vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline
  (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer
- vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline
  (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec
- vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py:
  thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from
  LLM output; truncate/pad vision codes to 729 (27x27) for decoder

Registry:
- vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and
  HyperCLOVAXAudioPipeline with post-process functions

Stage config:
- vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config
  Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4),
  Stage 2: audio decoder (GPU 5)

Bug fixes for HyperCLOVAX compatibility:
- diffusion/request.py: add extra dict field to OmniDiffusionRequest so
  vision_tokens/audio_tokens from stage input processors reach the pipeline
- entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra before creating request
- entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where
  thinker2vision_decoder/thinker2audio_decoder return [])
- entrypoints/async_omni.py: handle skipped sentinel in _process_single_result
  so text-only requests complete without crashing on Stage 1/2

* fix: correct decoder params and HCX porting fixes

- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50
  (matches OmniServe production defaults; 3.5 caused over-amplified
  autoguidance → shrunken/degraded output images)
- omni_stage.py: skip empty engine inputs for text-only requests
- async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra (audio_tokens/vision_tokens)
- registry.py: HCX Omni diffusion model registration fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes

* fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, OmniServe default)

* feat: add audio decoder Stage 2 to hcx_omni pipeline

- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml
- GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder)
- Add runtime edge 0->2 (thinker -> audio decoder)
- Implement post-generation PCM chunk streaming for audio output
  (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded)

Refs: github.com/vllm-project/vllm-omni/pull/869 (already incorporated)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: vllm version compatibility for HyperCLOVAX audio decoder startup

- config/model.py: try/except fallback for AttentionBackendEnum import
  (vllm.v1.attention.backends.registry absent in older vllm builds)
- pipeline_hyperclovax_audio.py: return actual named_parameters() from
  load_weights() when using MAR checkpoint so diffusers_loader strict
  check passes (weights loaded eagerly in __init__ via MAR extraction)
- qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs
  for check_interleaved_audio_video and merge_interleaved_embeddings
  which are absent in older vllm qwen2_5_omni_thinker; these symbols
  are only exercised by Qwen models, not HyperCLOVAX

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add edge 1→2 and correct model key in hcx_omni.yaml Stage 2

- Add runtime edge from:1 to:2 (required for Stage-2 connector init;
  without it AsyncOrchestrator cannot route to audio decoder at runtime)
- Change model_subdir to model for Stage-2 engine_args to match
  total-poc working reference config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: audio S2S output - handle diffusion outputs in _create_audio_choice

HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output
directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output
like LLM pipelines. Fix three locations:

1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output
   when final_res.outputs is empty (diffusion path).
2. Streaming audio path: same fix for _final_res.outputs[0].
3. Both loops (for output in final_res.outputs): fall back to single
   synthetic choice at index 0 when outputs list is empty.
4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process
   (returns WAV bytes, not tensors like Qwen3-Omni).

Also fixes audio input (A2T) regression: skip diffusion prompt extraction
when mm_data has audio content (added in previous session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: parse WAV bytes with soundfile for uniform PCM chunk streaming

HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header.
The previous byte-offset splitting included the header in the first
chunk, corrupting it. Fix: parse with soundfile to get float32 PCM,
then convert to int16 chunks uniformly regardless of source type
(bytes or tensor).

Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: zero-shot TTS with speaker embedding from input audio

- serving_chat.py: extract last input_audio base64 from request messages
  and inject as ref_audio_b64 into engine_prompt dict
- thinker2audio_decoder: read ref_audio_b64 from prompt and pass as
  ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline)
- hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot)
  which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup

Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis
matching the voice characteristics of the original speaker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: wire audio decoder Stage 2 to hcx_omni pipeline and fix S2S flow

- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml
  with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker
- Fix thinker2audio_decoder: correct audio token range (128606-135167),
  remap to [0, 6561) for BigVGAN input, handle empty token case gracefully
- Fix pipeline_hyperclovax_audio.py post_process_func signature and
  incorporate PR#869 BUG FIX patches for stable audio generation

* fix: use finetuned audio decoder and fix transformers_modules deserialization

- hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot,
  ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id).
  Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable
  for text-only requests and incompatible with finetuned decoder path.

- pipeline_hyperclovax_audio.py: guard ref_audio processing with
  'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder,
  so passing ref_audio bytes would crash with 'expected 100 channels'.

- omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules)
  to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles
  outputs containing custom classes from transformers_modules (trust_remote_code),
  but the API server process doesn't have this path, causing deserialization
  failures that silently drop Stage-0 outputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: restore zero-shot speaker cloning with fallback for text-only requests

- hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN)
  for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed
  integer speaker_id and lost the input speaker's voice.

- pipeline_hyperclovax_audio.py: add zero-mel fallback branch for
  finetune=False + ref_audio=None case. When a text-only request arrives
  (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor
  [1, num_mels, 64] instead of crashing with 'expected 100 channels'.
  S2S requests always have ref_audio so the zero-shot cloning path is
  unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add stage config yaml for HCX audio decoder

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* feat: add HCX audio decoder pipeline

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: modify exception for HCX audio decoder (GAN)

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

* fix: default temperature set to 0, and pipeline model evaluation mode

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

---------

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: dengyunyang <584797741@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: Kyle Huang <yellowsea@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
Signed-off-by: Fanli Lin <fanli0116@gmail.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: jzz <e1583181@u.nus.edu>
Signed-off-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: Pierre Le Guen <26087574+PierreLeGuen@users.noreply.github.com>
Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>
Signed-off-by: ram16g <anlianfengjie@163.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: pablo <juanz9312@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: anna <lee.anna@navercorp.com>
Signed-off-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: hsliu <liuhongsheng4@huawei.com>
Signed-off-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Signed-off-by: erfgss <97771661+erfgss@users.noreply.github.com>
Signed-off-by: xiedeyantu <czjourney@163.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <ljh_lbj@163.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: yenuo26 <410167048@qq.com>
Signed-off-by: Semmer2 <semmer@live.cn>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Signed-off-by: Ziming Huang <1520787127@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: GG-li <3226868735@qq.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: Hunter Liu <hunter@liu.sh>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: Shijin Zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: sniper35 <dongw2019@gmail.com>
Signed-off-by: Yanick Schraner <yanick.schraner@bs.ch>
Signed-off-by: Sangchun Ha <seomk9896@gmail.com>
Signed-off-by: jader <yjader@foxmail.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: AndyZhou952 <jzhoubc@connect.ust.hk>
Signed-off-by: Yupu <feng.yu.pu0330@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: xulusjb <fdukeshik@gmail.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Signed-off-by: Baoyuan Qi <qibaoyuan@126.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: baoyuan qi <qibaoyuan@126.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: 丁宁 <nndding@gmail.com>
Signed-off-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: dingning<dingning7@xiaomi.com>
Signed-off-by: dingning <dingning7@xiaomi.com>
Signed-off-by: dingning <dingning@xiaomi.com>
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Signed-off-by: shijin zhang <75300765+Dovis01@users.noreply.github.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Co-authored-by: Zeyu Huang | 黃澤宇 <11222265+fhfuih@users.noreply.github.com>
Co-authored-by: JohnJan <wuzhongjian_yewu@cmss.chinamobile.com>
Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: kYLe <yellowsea@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: NATURE <wzliu@connect.hku.hk>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: root <root@hk01dgx028.cm.cluster>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: amy-why-3459 <wuhaiyan17@huawei.com>
Co-authored-by: Rein Yang <ruiruyang2@gmail.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
Co-authored-by: dsinghvi <divyanshsinghvi@gmail.com>
Co-authored-by: Fanli Lin <fanli.lin@intel.com>
Co-authored-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: Andy Zhou <46011930+AndyZhou952@users.noreply.github.com>
Co-authored-by: Pierre LE GUEN <26087574+PierreLeGuen@users.noreply.github.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: ram16g <anlianfengjie@163.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>
Co-authored-by: Juan Pablo Zuluaga <46724788+JuanPZuluaga@users.noreply.github.com>
Co-authored-by: muziyuhui666 <111362884+muziyuhui666@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: ceanna93 <fairyanna@naver.com>
Co-authored-by: anna <lee.anna@navercorp.com>
Co-authored-by: Rustam Khadipash <16683750+hadipash@users.noreply.github.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: hsliu_ustc <hsliu_ustc@noreply.gitcode.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: erfgss <97771661+erfgss@users.noreply.github.com>
Co-authored-by: Jensen <czjourney@163.com>
Co-authored-by: Junhong Liu <ljh_lbj@163.com>
Co-authored-by: weichen <calvin_zhu0210@outlook.com>
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Jiaping Wu <53215702+ElleElleWu@users.noreply.github.com>
Co-authored-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Gao Han <gaohan19@huawei.com>
Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: Sihao Li <111170255+GG-li@users.noreply.github.com>
Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com>
Co-authored-by: Bhanu068 <voutharoja.bhanu06@gmail.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: yenuo26 <410167048@qq.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: ZJY0516 <zhu.jiangyun@foxmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: Sy03 <1370724210@qq.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: UsamaKenway <56207634+UsamaKenway@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: wuhang <wuhang6@huawei.com>
Co-authored-by: pablo <pablo@agigo.ai>
Co-authored-by: SHIJIN ZHANG <75300765+Dovis01@users.noreply.github.com>
Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com>
Co-authored-by: Yanick Schraner <yanick.schraner@gmail.com>
Co-authored-by: Sangchun Ha <seomk9896@naver.com>
Co-authored-by: 亦瑾 <76905040+yJader@users.noreply.github.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Yupu <feng.yu.pu0330@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: zhumingjue138 <zhumingjue@huawei.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Jared Wen <w13431838023@gmail.com>
Co-authored-by: Xu Lu <572605156@qq.com>
Co-authored-by: xulusjb <fdukeshik@gmail.com>
Co-authored-by: Baoyuan Qi <qibaoyuan@xiaomi.com>
Co-authored-by: Zhang Shijin <zhangshijin@xiaomi.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: shijin zhang <zsj1364226740@gmail.com>
Co-authored-by: Prajwal A <34590600+LawJarp-A@users.noreply.github.com>
Co-authored-by: dingning <dingning7@xiaomi.com>
Co-authored-by: ning ding <nndding@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Ting FU <futing10@huawei.com>
Co-authored-by: developer-account <irteam@vllm-omni-dev-0.vllm-omni-dev.p-nb13557.svc.cluster.local>
Co-authored-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants