[Main2Main] Upgrade vLLM to 0305 by MrZ20 · Pull Request #7005 · vllm-project/vllm-ascend

MrZ20 · 2026-03-05T06:53:47Z

What this PR does / why we need it?

break:

[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache vllm#30681

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@15d76f7

gemini-code-assist · 2026-03-05T06:54:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on upgrading the vLLM integration to align with version 0.3.5, as indicated by the updated versioning policy. It introduces crucial compatibility patches for torch.ops._C symbols to prevent import failures in Ascend environments and adapts the CUDAGraph dispatching mechanism to accommodate API changes in vLLM 0.16.0. Additionally, it refines the model compilation process by returning compilation time and cleans up an unused linter integration.

Highlights

vLLM Version Update: The vLLM base version reference in the documentation was updated, indicating an alignment with a newer vLLM commit (likely 0.3.5 based on the PR title).
Ascend Compatibility Patch: A new compatibility patch was introduced to handle missing torch.ops._C symbols (like rms_norm, rotary_embedding, quantization ops) in the Ascend environment, preventing import failures.
CUDAGraph Dispatching Adaptation: The CUDAGraph dispatching logic in ModelRunnerV1 was adapted to conditionally use different arguments based on the detected vLLM version, specifically for version 0.16.0.
Compilation Time Reporting: The compile_or_warm_up_model method was modified to return the compilation time, enhancing visibility into the model's setup performance.
Linter Integration Removal: An unused _bc_linter import and its associated decorator were removed from recompute_scheduler.py, streamlining the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/source/community/versioning_policy.md
- Updated the vLLM commit hash for the main branch.
tests/ut/_310p/quantization/test_modelslim_config_310.py
- Added moe_backend = "auto" to FusedMoEConfig in a test case.
tests/ut/distributed/test_communicator.py
- Patched dist.distributed_c10d._world.pg_map with MagicMock during NPUCommunicator initialization in unit tests.
vllm_ascend/core/recompute_scheduler.py
- Removed vllm._bc_linter import and @bc_linter_include decorator.
vllm_ascend/ops/mm_encoder_attention.py
- Added sequence_lengths parameter to the forward_oot method.
vllm_ascend/patch/init.py
- Documented the new patch_fusion_matcher_compat_ops.py file.
vllm_ascend/patch/platform/init.py
- Imported the new patch_fusion_matcher_compat_ops module.
vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py
- Added a new file to define a compatibility layer for missing torch.ops._C symbols.
vllm_ascend/worker/model_runner_v1.py
- Imported vllm_version_is.
- Modified dispatch_cudagraph to use version-specific arguments for cudagraph_dispatcher.dispatch.
vllm_ascend/worker/worker.py
- Changed the return type of compile_or_warm_up_model to float.
- Returned compilation_time from compile_or_warm_up_model.

Ignored Files

Ignored by pattern: .github/workflows/** (5)
- .github/workflows/bot_pr_create.yaml
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml
- .github/workflows/schedule_codecov_refresh.yaml

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request upgrades the vLLM dependency and introduces several compatibility changes. The changes include updating documentation, fixing tests, and adding compatibility shims for different vLLM versions. I've found an opportunity to improve maintainability by refactoring duplicated code in vllm_ascend/worker/model_runner_v1.py.

As per the repository's style guide, here are suggestions for the pull request title and summary:

Suggested PR Title:

[Ops][Misc] Upgrade vLLM dependency

Suggested PR Summary:

### What this PR does / why we need it?
This pull request upgrades the vLLM dependency to a newer version (from commit `e2b3124...`) and introduces several changes to ensure compatibility with the updated upstream code.

Key changes include:
- Updating the vLLM commit hash in the versioning policy documentation.
- Modifying test setups to align with changes in the vLLM testing framework.
- Removing deprecated code, such as `bc_linter_include`.
- Adding a compatibility patch (`patch_fusion_matcher_compat_ops.py`) to handle missing PyTorch operators on the Ascend platform, preventing import-time errors.
- Introducing version-specific logic in `NPUModelRunner` to handle API differences in `cudagraph_dispatcher` between vLLM versions.

These changes are necessary to keep `vllm-ascend` in sync with the latest developments in the core vLLM repository.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily consists of internal dependency upgrades and compatibility fixes. There are no user-facing API or behavior changes.

### How was this patch tested?
CI should pass. The changes include updates to unit tests to ensure they pass with the new vLLM version.

gemini-code-assist · 2026-03-05T06:56:26Z

+        if vllm_version_is("0.16.0"):
+
+            def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
+                if force_eager:
+                    return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))
+                return self.cudagraph_dispatcher.dispatch(
+                    num_tokens=num_tokens,
+                    has_lora=has_lora,
+                    uniform_decode=uniform_decode,
+                    disable_full=disable_full,
+                )
+        else:
+
+            def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
+                if force_eager:
+                    return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))
+                return self.cudagraph_dispatcher.dispatch(
+                    num_tokens=num_tokens,
+                    has_lora=has_lora,
+                    uniform_decode=uniform_decode,
+                    valid_modes=valid_modes,
+                    invalid_modes={CUDAGraphMode.FULL} if disable_full else None,
+                )


There is significant code duplication in the dispatch_cudagraph function definition for the two vLLM version branches. This makes the code harder to maintain and increases the risk of introducing bugs if one branch is modified and the other is not. This can be refactored to define the function once and handle the version-specific logic inside.

def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None): if force_eager: return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded)) common_args = { "num_tokens": num_tokens, "has_lora": has_lora, "uniform_decode": uniform_decode, } if vllm_version_is("0.16.0"): return self.cudagraph_dispatcher.dispatch( **common_args, disable_full=disable_full, ) else: return self.cudagraph_dispatcher.dispatch( **common_args, valid_modes=valid_modes, invalid_modes={CUDAGraphMode.FULL} if disable_full else None, )

github-actions · 2026-03-05T06:57:58Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2026-03-06T01:15:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MrZ20 <2609716663@qq.com>

jikunshang · 2026-03-09T14:01:13Z

may I know why and how empty_cache break on Ascend side?

gcanlin · 2026-03-09T14:59:14Z

may I know why and how empty_cache break on Ascend side?

Thanks for attention! We found that torch.accelerator.empty_cache seems to be not ready in torch_npu v2.9.0. And the CI met the same error before. cc @Yikun @wangxiyuan @MengqingCao @fffrog

We may need to check whether it's ready in torch_npu v2.10.0 and upgrade torch version. But before that, our main2main would be keeping breaking.

>>> import torch
>>> import torch_npu
/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch_npu/__init__.py:309: UserWarning: On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default.                      Do not set it to 1 to prevent some unknown errors
  warnings.warn("On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default. \
>>> torch.accelerator.empty_cache()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch/accelerator/memory.py", line 28, in empty_cache
    if not torch._C._accelerator_isAllocatorInitialized():
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device_allocator INTERNAL ASSERT FAILED at "/pytorch/c10/core/CachingDeviceAllocator.h":109, please report a bug to PyTorch. Allocator for npu is not a DeviceAllocator.

jikunshang · 2026-03-09T15:19:30Z

sorry to hear that. I thought that empty_cache should work on torch2.9...

github-actions · 2026-03-12T09:30:12Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

MrZ20 force-pushed the main_0305 branch from c53611b to 48b4acc Compare March 5, 2026 06:57

Potabk added ready read for review ready-for-test start test by label for PR labels Mar 5, 2026

github-actions bot added documentation Improvements or additions to documentation module:tests module:ops labels Mar 5, 2026

MrZ20 force-pushed the main_0305 branch from 48b4acc to c629fc4 Compare March 5, 2026 08:52

github-actions bot added the merge-conflicts label Mar 6, 2026

MrZ20 force-pushed the main_0305 branch from c629fc4 to 63497a8 Compare March 6, 2026 02:56

github-actions bot removed the merge-conflicts label Mar 6, 2026

fix torch.accelerator

90a06fb

Signed-off-by: MrZ20 <2609716663@qq.com>

MrZ20 force-pushed the main_0305 branch from 63497a8 to 90a06fb Compare March 6, 2026 06:27

update

505e8fb

Signed-off-by: MrZ20 <2609716663@qq.com>

MrZ20 force-pushed the main_0305 branch from e652f7f to 505e8fb Compare March 6, 2026 08:51

github-actions bot added the merge-conflicts label Mar 12, 2026

MrZ20 closed this Mar 13, 2026

MrZ20 deleted the main_0305 branch April 8, 2026 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Main2Main] Upgrade vLLM to 0305#7005

[Main2Main] Upgrade vLLM to 0305#7005
MrZ20 wants to merge 2 commits intovllm-project:mainfrom
MrZ20:main_0305

MrZ20 commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

gcanlin commented Mar 9, 2026 •

edited

Loading

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MrZ20 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

gcanlin commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jikunshang commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MrZ20 commented Mar 5, 2026 •

edited

Loading

gcanlin commented Mar 9, 2026 •

edited

Loading