Skip to content

[Feat] Add NPU Backend support for vLLM-Omni#89

Merged
Gaohan123 merged 28 commits intomainfrom
npu
Nov 30, 2025
Merged

[Feat] Add NPU Backend support for vLLM-Omni#89
Gaohan123 merged 28 commits intomainfrom
npu

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Nov 26, 2025

Co-authored-by: AndyZhou952 jzhoubc@connect.ust.hk
Co-authored-by: MengqingCao cmq0113@163.com

Purpose

Because plugin support is not yet available in vllm-omni, we are temporarily merging the NPU ModelRunner into the codebase. This implementation will be removed later and replaced with a plugin-based integration once the plugin system is supported.

Test Plan

Install vllm-omni on vllm-ascend v0.11.0rc2 image:

# Update DEVICE according to your NPUs (/dev/davinci[0-7])
export DEVICE0=/dev/davinci0
export DEVICE1=/dev/davinci1
# Update the vllm-ascend image
# Atlas A2:
# export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2
# Atlas A3:
# export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2-a3
export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2
docker run --rm \
    --name vllm-omni-npu \
    --shm-size=1g \
    --device $DEVICE0 \
    --device $DEVICE1 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -p 8000:8000 \
    -it $IMAGE bash

cd /vllm-workspace
git clone -b v0.11.0rc1 https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -v -e .

Then use the example to test Qwen/Qwen2.5-Omni-7B on NPU:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
python examples/offline_inference/qwen2_5_omni/end2end.py --output-wav output_audio \
                  --query-type text

And test Qwen/Qwen-Image:

python examples/offline_inference/qwen_image/text_to_image.py \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png

Test Result

Qwen2.5-Omni-7B outputs the audio successfully(For showing it on GitHub, I have to convert wav to mp4 using ffmpeg):

output.mp4

Qwen-Image output the coffee picture:

coffee

Performance at first time(Maybe stale):

Device Thinker(avg_tokens_per_s) Thinker(total_time_ms) Talker(avg_tokens_per_s) Talker(total_time_ms) Code2wav(avg_tokens_per_s) Code2wav(total_time_ms) e2e_total_time_ms
NPU 31.49 1555.87 39.26 22030.20 0.00 133804.23 157461.91
GPU-before 39.60 1237.25 52.76 15162.34 0.0 5713.04 22121.32
GPU-after 40.27 1216.89 51.18 15631.15 0.0 5697.12 22573.35

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

@gcanlin gcanlin marked this pull request as ready for review November 27, 2025 15:09
@gcanlin gcanlin changed the title [WIP][Feat] Add Ascend NPU Backend Support for Qwen2.5-Omni [Feat] Add Ascend NPU support for omni model Nov 27, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py Outdated
@gcanlin gcanlin changed the title [Feat] Add Ascend NPU support for omni model [Feat] Add Ascend NPU Backend support for omni model Nov 28, 2025
@gcanlin gcanlin changed the title [Feat] Add Ascend NPU Backend support for omni model [Feat] Add Ascend NPU Backend support for vLLM-Omni Nov 28, 2025
@gcanlin gcanlin changed the title [Feat] Add Ascend NPU Backend support for vLLM-Omni [Feat] Add NPU Backend support for vLLM-Omni Nov 29, 2025
Comment thread docs/getting_started/installation/npu/npu.inc.md
Comment thread docs/getting_started/installation/npu/npu.inc.md Outdated
Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
Comment thread vllm_omni/entrypoints/stage_utils.py Outdated
Comment thread vllm_omni/worker/npu/npu_ar_worker.py Outdated
Comment thread vllm_omni/entrypoints/omni_llm.py Outdated
Comment thread vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py Outdated
Comment thread vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py
@@ -0,0 +1,32 @@
from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will optimize the platform plugin system next month

Copy link
Copy Markdown
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin mind also pasting your test results based on new rebase?

Otherwise LGTM

@Yikun
Copy link
Copy Markdown
Member

Yikun commented Nov 29, 2025

@wangxiyuan It's better to have a final look.

@Yikun
Copy link
Copy Markdown
Member

Yikun commented Nov 29, 2025

@ywang96 And pls.

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Nov 29, 2025

Online:

(APIServer pid=1436236) INFO 11-29 16:02:28 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1436236) INFO 11-29 16:02:28 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1436236) INFO 11-29 16:02:28 [launcher.py:42] Route: /invocations, Methods: POST
(APIServer pid=1436236) INFO 11-29 16:02:28 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1436236) INFO 11-29 16:02:28 [launcher.py:42] Route: /metrics, Methods: GET
(APIServer pid=1436236) INFO:     Started server process [1436236]
(APIServer pid=1436236) INFO:     Waiting for application startup.
(APIServer pid=1436236) INFO:     Application startup complete.
(APIServer pid=1436236) WARNING 11-29 16:03:25 [protocol.py:93] The following fields were present in the request but ignored: {'sampling_params_list'}
(APIServer pid=1436236) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(APIServer pid=1436236) INFO 11-29 16:03:33 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
--------------------------------
[Stage-0] Received batch size=1, request_ids=chatcmpl-5b9adcce5cf54c049c54aee7d935c4d6
--------------------------------
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(APIServer pid=1436236) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-1] Received batch size=1, request_ids=chatcmpl-5b9adcce5cf54c049c54aee7d935c4d6
--------------------------------
(EngineCore_DP0 pid=1437676) /vllm-workspace/vllm-omni/vllm_omni/worker/npu/npu_model_runner.py:181: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=1437676)   info_dict[k] = torch.from_numpy(arr)
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-2] Received batch size=1, request_ids=chatcmpl-5b9adcce5cf54c049c54aee7d935c4d6
--------------------------------
(EngineCore_DP0 pid=1438662) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
INFO 11-29 16:04:00 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-29 16:04:00 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-29 16:04:00 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-29 16:04:00 [__init__.py:207] Platform plugin ascend is activated
INFO 11-29 16:04:04 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-29 16:04:05 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO:numexpr.utils:Note: detected 320 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 320 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO:numexpr.utils:NumExpr defaulting to 16 threads.
INFO:datasets:PyTorch version 2.7.1+cpu available.
...('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(APIServer pid=1436236) INFO:vllm_omni.entrypoints.async_omni_llm:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 157010.18977165222, 'e2e_sum_time_ms': 157009.96160507202, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 157009.96160507202, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 157010.18977165222, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 49, 'total_time_ms': 1536.33451461792, 'avg_time_per_request_ms': 1536.33451461792, 'avg_tokens_per_s': 31.894095676283168}, {'stage_id': 1, 'requests': 1, 'tokens': 865, 'total_time_ms': 21141.846418380737, 'avg_time_per_request_ms': 21141.846418380737, 'avg_tokens_per_s': 40.914118042592925}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 134305.8784008026, 'avg_time_per_request_ms': 134305.8784008026, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 1652666, 'total_time_ms': 2.8274059295654297, 'tx_mbps': 4676.133646657559, 'rx_samples': 1, 'rx_total_bytes': 1652666, 'rx_total_time_ms': 3.074169158935547, 'rx_mbps': 4300.780899310687, 'total_samples': 1, 'total_transfer_time_ms': 6.52003288269043, 'total_mbps': 2027.8008160204777}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 2631, 'total_time_ms': 0.29730796813964844, 'tx_mbps': 70.79527713873296, 'rx_samples': 1, 'rx_total_bytes': 2631, 'rx_total_time_ms': 0.06365776062011719, 'rx_mbps': 330.64311083146066, 'total_samples': 1, 'total_transfer_time_ms': 1.2247562408447266, 'total_mbps': 17.18546050068133}]}
(APIServer pid=1436236) INFO:     127.0.0.1:34664 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Nov 29, 2025

Qwen-Image:

image

Comment thread vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py Outdated
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Nov 30, 2025

run_multiple_prompts.sh:

(EngineCore_DP0 pid=1636464) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
[Stage-2] Generate done: batch=1, req_ids=[9], gen_ms=100550.2
INFO:vllm_omni.entrypoints.omni_llm:[Summary] {'e2e_requests': 10, 'e2e_total_time_ms': 1088623.161315918, 'e2e_sum_time_ms': 6923636.155605316, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 692363.6155605316, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 1088623.161315918, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 10, 'tokens': 768, 'total_time_ms': 21800.796270370483, 'avg_time_per_request_ms': 2180.0796270370483, 'avg_tokens_per_s': 35.228071051872114}, {'stage_id': 1, 'requests': 10, 'tokens': 8898, 'total_time_ms': 209228.48272323608, 'avg_time_per_request_ms': 20922.84827232361, 'avg_tokens_per_s': 42.527670631584726}, {'stage_id': 2, 'requests': 10, 'tokens': 0, 'total_time_ms': 1059759.6077919006, 'avg_time_per_request_ms': 105975.96077919006, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 10, 'total_bytes': 20398142, 'total_time_ms': 24.391889572143555, 'tx_mbps': 6690.139175866207, 'rx_samples': 10, 'rx_total_bytes': 20398142, 'rx_total_time_ms': 23.947715759277344, 'rx_mbps': 6814.225525321015, 'total_samples': 10, 'total_transfer_time_ms': 923413.3291244507, 'total_mbps': 0.17671949370140308}, {'from_stage': 1, 'to_stage': 2, 'samples': 10, 'total_bytes': 26938, 'total_time_ms': 0.7150173187255859, 'tx_mbps': 301.39689537045683, 'rx_samples': 10, 'rx_total_bytes': 26938, 'rx_total_time_ms': 0.3275871276855469, 'rx_mbps': 657.8524666783115, 'total_samples': 10, 'total_transfer_time_ms': 4609651.284694672, 'total_mbps': 4.6750607950656354e-05}]}
Request ID: 0, Text saved to output_audio/00000.txt
Request ID: 1, Text saved to output_audio/00001.txt
Request ID: 2, Text saved to output_audio/00002.txt
Request ID: 3, Text saved to output_audio/00003.txt
Request ID: 4, Text saved to output_audio/00004.txt
Request ID: 5, Text saved to output_audio/00005.txt
Request ID: 6, Text saved to output_audio/00006.txt
Request ID: 7, Text saved to output_audio/00007.txt
Request ID: 8, Text saved to output_audio/00008.txt
Request ID: 9, Text saved to output_audio/00009.txt
Request ID: 0, Saved audio to output_audio/output_0.wav
Request ID: 1, Saved audio to output_audio/output_1.wav
Request ID: 2, Saved audio to output_audio/output_2.wav
Request ID: 3, Saved audio to output_audio/output_3.wav
Request ID: 4, Saved audio to output_audio/output_4.wav
Request ID: 5, Saved audio to output_audio/output_5.wav
Request ID: 6, Saved audio to output_audio/output_6.wav
Request ID: 7, Saved audio to output_audio/output_7.wav
Request ID: 8, Saved audio to output_audio/output_8.wav
Request ID: 9, Saved audio to output_audio/output_9.wav

from vllm.v1.core.sched.scheduler import Request, RequestStatus, SchedulerOutput, SpecDecodingStats
from vllm.v1.core.sched.utils import remove_all
from vllm.v1.engine import EngineCoreEventType, EngineCoreOutput
from vllm.v1.engine import EngineCoreEventType, EngineCoreOutput, EngineCoreOutputs
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit also fix the common bug both on GPU and NPU. We should import EngineCoreOutputs from vllm.v1.engine to make sure the patch work.

Copy link
Copy Markdown
Collaborator Author

@gcanlin gcanlin Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't take any break in GPU.

  • run_multiple_prompts.sh
  • run_single_prompt.sh
  • text
  • use_audio_in_video
  • online: break by other bugs for now.
image

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Nov 30, 2025

I think that this PR is ready now. There is a legacy issue that batch size can't be set more than 1. We will fix it in the next PR.
As a temporary workaround, the batch size in config file is set to 1.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
  * fix EngineCoreOutput import path
  * fix the pooleroutput in NPUModelRunner
  * limit the batch size to 1 in qwen2.5-omni.yaml of npu

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the great work!

@Gaohan123 Gaohan123 merged commit ec9e040 into main Nov 30, 2025
3 checks passed
@AndyZhou952 AndyZhou952 linked an issue Dec 1, 2025 that may be closed by this pull request
30 tasks
@Gaohan123 Gaohan123 deleted the npu branch December 1, 2025 09:52
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Add NPU Backend Support for vLLM-Omni (Qwen2.5-Omni)

7 participants