-
Notifications
You must be signed in to change notification settings - Fork 973
[Feat] Add NPU Backend support for vLLM-Omni #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
8508c1b
[Feat] Add Ascend NPU Backend Support for Qwen2.5-Omni
gcanlin ca1b48b
fix code-check
gcanlin ffa396f
fix some bugs
gcanlin 3e95be0
add the mrope support
gcanlin ac0e7e2
clean code
gcanlin bd4104d
revert omni_llm.py changes
gcanlin 5643b44
revert qwen2_5_omni.py download changes
gcanlin ec7496d
fix lint
gcanlin d53a06e
fix lint
gcanlin 2969188
fix bugs
gcanlin b271fe6
add docs for npu
gcanlin d37bfdd
fix bugs
gcanlin 76a2745
fix lint
gcanlin b53c41a
refactor the way to select device
gcanlin fecc392
fix lint
gcanlin 754cab0
Make OmniNPUModelRunner cleaner
gcanlin b6271c6
Seperate NPU adapation to a clean branch
gcanlin 073ebfe
Update docs
gcanlin 31aa44c
Update comments
gcanlin 7ca7a3e
revert unrelated changes
gcanlin 49b6745
Update comments
gcanlin 6d22a73
Update docs
gcanlin b4c6da9
Update docs
gcanlin a23a32a
fix lint
gcanlin fd8f2ac
Support NPU backend for Diffusion
gcanlin 0e9b7d2
Some fixes
gcanlin cff11c0
optimize token2wav
gcanlin cb5eee4
fix lint
gcanlin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,23 @@ | ||
| # Ascend-NPU | ||
| # NPU | ||
|
|
||
| vLLM-Omni is a Python library that supports the following NPU variants. Select your NPU type to see vendor specific instructions: | ||
| vLLM-Omni supports NPU through the vLLM Ascend Plugin (vllm-ascend). This is a community maintained hardware plugin for running vLLM on NPU. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - OS: Linux | ||
| - Python: 3.12 | ||
|
|
||
| !!! note | ||
| vLLM-Omni is currently not natively supported on Windows. | ||
|
|
||
| === "NPU" | ||
|
|
||
| --8<-- "docs/getting_started/installation/npu/npu.inc.md:requirements" | ||
|
|
||
| ## Installation | ||
|
|
||
| ### Recommended | ||
|
|
||
| === "NPU" | ||
|
|
||
| --8<-- "docs/getting_started/installation/npu/npu.inc.md:installation" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,48 @@ | ||
| # --8<-- [start:requirements] | ||
|
|
||
| For detailed hardware and software requirements, please refer to the [vllm-ascend installation documentation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html). | ||
|
|
||
| # --8<-- [end:requirements] | ||
| # --8<-- [start:installation] | ||
|
|
||
| vLLM-Omni mainly contains python implementations for framework and models. | ||
| The recommended way to use vLLM-Omni on NPU is through the vllm-ascend pre-built Docker images: | ||
|
|
||
| ```bash | ||
| # Update DEVICE according to your NPUs (/dev/davinci[0-7]) | ||
| export DEVICE0=/dev/davinci0 | ||
| export DEVICE1=/dev/davinci1 | ||
| # Update the vllm-ascend image | ||
| # Atlas A2: | ||
| # export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2 | ||
| # Atlas A3: | ||
| # export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2-a3 | ||
| export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2 | ||
| docker run --rm \ | ||
| --name vllm-omni-npu \ | ||
| --device $DEVICE0 \ | ||
| --device $DEVICE1 \ | ||
| --device /dev/davinci_manager \ | ||
| --device /dev/devmm_svm \ | ||
| --device /dev/hisi_hdc \ | ||
| -v /usr/local/dcmi:/usr/local/dcmi \ | ||
| -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ | ||
| -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ | ||
| -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ | ||
| -v /etc/ascend_install.info:/etc/ascend_install.info \ | ||
| -v /root/.cache:/root/.cache \ | ||
| -p 8000:8000 \ | ||
| -it $IMAGE bash | ||
|
|
||
| # Inside the container, install vLLM-Omni from source | ||
| cd /vllm-workspace | ||
| git clone https://github.com/vllm-project/vllm-omni.git | ||
| cd vllm-omni | ||
| pip install -v -e . | ||
| export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| ``` | ||
|
|
||
| The default workdir is `/workspace`, with vLLM, vLLM-Ascend and vLLM-Omni code placed in `/vllm-workspace` installed in development mode. | ||
|
|
||
| For other installation methods (pip installation, building from source, custom Docker builds), please refer to the [vllm-ascend installation guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html). | ||
|
|
||
| # --8<-- [end:installation] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,9 +4,9 @@ | |
|
|
||
| from vllm.v1.core.kv_cache_manager import KVCacheBlocks | ||
| from vllm.v1.core.sched.request_queue import create_request_queue | ||
| from vllm.v1.core.sched.scheduler import EngineCoreOutputs, Request, RequestStatus, SchedulerOutput, SpecDecodingStats | ||
| from vllm.v1.core.sched.scheduler import Request, RequestStatus, SchedulerOutput, SpecDecodingStats | ||
| from vllm.v1.core.sched.utils import remove_all | ||
| from vllm.v1.engine import EngineCoreEventType, EngineCoreOutput | ||
| from vllm.v1.engine import EngineCoreEventType, EngineCoreOutput, EngineCoreOutputs | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This commit also fix the common bug both on GPU and NPU. We should import
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| from vllm_omni.core.sched.output import OmniNewRequestData | ||
| from vllm_omni.core.sched.scheduler import OmniScheduler | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| import multiprocessing as mp | ||
| import os | ||
|
|
||
| import torch | ||
| import zmq | ||
| from vllm.config import VllmConfig, set_current_vllm_config | ||
| from vllm.distributed.device_communicators.shm_broadcast import MessageQueue | ||
| from vllm.distributed.parallel_state import ( | ||
| init_distributed_environment, | ||
| initialize_model_parallel, | ||
| ) | ||
| from vllm.logger import init_logger | ||
|
|
||
| from vllm_omni.diffusion.data import DiffusionOutput, OmniDiffusionConfig | ||
| from vllm_omni.diffusion.registry import initialize_model | ||
| from vllm_omni.diffusion.request import OmniDiffusionRequest | ||
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class NPUWorker: | ||
| """ | ||
| A worker that executes the model on a single NPU. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| local_rank: int, | ||
| rank: int, | ||
| od_config: OmniDiffusionConfig, | ||
| ): | ||
| self.local_rank = local_rank | ||
| self.rank = rank | ||
| self.od_config = od_config | ||
| self.pipeline = None | ||
|
|
||
| self.init_device_and_model() | ||
|
|
||
| def init_device_and_model(self) -> None: | ||
| """Initialize the device and load the model.""" | ||
| world_size = self.od_config.num_gpus | ||
| rank = self.rank | ||
| # Set environment variables for distributed initialization | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| os.environ["MASTER_PORT"] = str(self.od_config.master_port) | ||
| os.environ["LOCAL_RANK"] = str(self.local_rank) | ||
| os.environ["RANK"] = str(rank) | ||
| os.environ["WORLD_SIZE"] = str(world_size) | ||
|
|
||
| device = torch.device(f"npu:{rank}") | ||
| torch.npu.set_device(device) | ||
|
|
||
| # hack | ||
| vllm_config = VllmConfig() | ||
| vllm_config.parallel_config.tensor_parallel_size = self.od_config.num_gpus | ||
| set_current_vllm_config(vllm_config) | ||
|
|
||
| init_distributed_environment(world_size=world_size, rank=rank) | ||
| initialize_model_parallel(tensor_model_parallel_size=world_size) | ||
|
|
||
| with device: | ||
| self.pipeline = initialize_model(self.od_config) | ||
| self.pipeline.load_weights() | ||
| self.pipeline.eval() | ||
| logger.info(f"Worker {self.rank}: Initialized device, model, and distributed environment.") | ||
| logger.info(f"Worker {self.rank}: Model loaded successfully.") | ||
|
|
||
| @torch.inference_mode() | ||
| def execute_model(self, reqs: list[OmniDiffusionRequest], od_config: OmniDiffusionConfig) -> DiffusionOutput: | ||
| """ | ||
| Execute a forward pass. | ||
| """ | ||
| assert self.pipeline is not None | ||
| # TODO: dealing with first req for now | ||
| req = reqs[0] | ||
| output = self.pipeline.forward(req) | ||
| return output | ||
|
|
||
|
|
||
| class NPUWorkerProc: | ||
| """Wrapper that runs one Worker in a separate process.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| od_config: OmniDiffusionConfig, | ||
| gpu_id: int, | ||
| broadcast_handle, | ||
| ): | ||
| self.od_config = od_config | ||
|
|
||
| # Inter-process Communication | ||
| self.context = zmq.Context(io_threads=2) | ||
|
|
||
| # Initialize MessageQueue reader from handle | ||
| self.mq = MessageQueue.create_from_handle(broadcast_handle, gpu_id) | ||
|
|
||
| self.result_mq = None | ||
| self.result_mq_handle = None | ||
|
|
||
| # Setup result sender (only for rank 0 for now, or whoever needs to reply) | ||
| # Assuming only rank 0 replies to scheduler as per original logic | ||
| if gpu_id == 0: | ||
| # Create MessageQueue for results (1 writer -> 1 reader) | ||
| # We assume the reader (SyncScheduler) will act as rank 0 | ||
| self.result_mq = MessageQueue(n_reader=1, n_local_reader=1, local_reader_ranks=[0]) | ||
| self.result_mq_handle = self.result_mq.export_handle() | ||
| logger.info(f"Worker {gpu_id} created result MessageQueue") | ||
|
|
||
| assert od_config.master_port is not None | ||
| worker = NPUWorker( | ||
| local_rank=gpu_id, | ||
| rank=gpu_id, | ||
| od_config=od_config, | ||
| ) | ||
| self.worker = worker | ||
| self.gpu_id = gpu_id | ||
| self._running = True | ||
|
|
||
| def return_result(self, output: DiffusionOutput): | ||
| """ | ||
| replies to client, only on rank 0 | ||
| """ | ||
| if self.result_mq is not None: | ||
| self.result_mq.enqueue(output) | ||
|
|
||
| def recv_reqs(self): | ||
| """ | ||
| Receive requests from broadcast queue | ||
| """ | ||
| return self.mq.dequeue() | ||
|
|
||
| # TODO: queueing, cancellation | ||
| def worker_busy_loop(self) -> None: | ||
| """Main busy loop for Multiprocessing Workers""" | ||
|
|
||
| logger.info(f"Worker {self.gpu_id} ready to receive requests via shared memory") | ||
|
|
||
| while self._running: | ||
| reqs = None | ||
| # 1: receive requests | ||
| try: | ||
| reqs = self.recv_reqs() | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error receiving requests in scheduler event loop: {e}", | ||
| exc_info=True, | ||
| ) | ||
| continue | ||
|
|
||
| # 2: execute, make sure a reply is always sent | ||
| try: | ||
| output = self.worker.execute_model(reqs, self.od_config) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error executing forward in event loop: {e}", | ||
| exc_info=True, | ||
| ) | ||
| output = DiffusionOutput(error=str(e)) | ||
|
|
||
| try: | ||
| self.return_result(output) | ||
| except zmq.ZMQError as e: | ||
| # Reply failed; log and keep loop alive to accept future requests | ||
| logger.error(f"ZMQ error sending reply: {e}") | ||
| continue | ||
|
|
||
| logger.info("event loop terminated.") | ||
| # if self.result_sender is not None: | ||
| # self.result_sender.close() | ||
| self.context.term() | ||
|
|
||
| @staticmethod | ||
| def worker_main( | ||
| rank: int, | ||
| od_config: OmniDiffusionConfig, | ||
| pipe_writer: mp.connection.Connection, | ||
| broadcast_handle, | ||
| ) -> None: | ||
| """Worker initialization and execution loops.""" | ||
|
|
||
| worker_proc = NPUWorkerProc( | ||
| od_config, | ||
| gpu_id=rank, | ||
| broadcast_handle=broadcast_handle, | ||
| ) | ||
| logger.info(f"Worker {rank}: Scheduler loop started.") | ||
| pipe_writer.send( | ||
| { | ||
| "status": "ready", | ||
| "result_handle": worker_proc.result_mq_handle if rank == 0 else None, | ||
| } | ||
| ) | ||
| worker_proc.worker_busy_loop() | ||
| logger.info(f"Worker {rank}: Shutdown complete.") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Uh oh!
There was an error while loading. Please reload this page.