Skip to content

[Main2Main] Upgrade vLLM to 0305#7005

Closed
MrZ20 wants to merge 2 commits intovllm-project:mainfrom
MrZ20:main_0305
Closed

[Main2Main] Upgrade vLLM to 0305#7005
MrZ20 wants to merge 2 commits intovllm-project:mainfrom
MrZ20:main_0305

Conversation

@MrZ20
Copy link
Copy Markdown
Contributor

@MrZ20 MrZ20 commented Mar 5, 2026

What this PR does / why we need it?

break:

Does this PR introduce any user-facing change?

How was this patch tested?

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on upgrading the vLLM integration to align with version 0.3.5, as indicated by the updated versioning policy. It introduces crucial compatibility patches for torch.ops._C symbols to prevent import failures in Ascend environments and adapts the CUDAGraph dispatching mechanism to accommodate API changes in vLLM 0.16.0. Additionally, it refines the model compilation process by returning compilation time and cleans up an unused linter integration.

Highlights

  • vLLM Version Update: The vLLM base version reference in the documentation was updated, indicating an alignment with a newer vLLM commit (likely 0.3.5 based on the PR title).
  • Ascend Compatibility Patch: A new compatibility patch was introduced to handle missing torch.ops._C symbols (like rms_norm, rotary_embedding, quantization ops) in the Ascend environment, preventing import failures.
  • CUDAGraph Dispatching Adaptation: The CUDAGraph dispatching logic in ModelRunnerV1 was adapted to conditionally use different arguments based on the detected vLLM version, specifically for version 0.16.0.
  • Compilation Time Reporting: The compile_or_warm_up_model method was modified to return the compilation time, enhancing visibility into the model's setup performance.
  • Linter Integration Removal: An unused _bc_linter import and its associated decorator were removed from recompute_scheduler.py, streamlining the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source/community/versioning_policy.md
    • Updated the vLLM commit hash for the main branch.
  • tests/ut/_310p/quantization/test_modelslim_config_310.py
    • Added moe_backend = "auto" to FusedMoEConfig in a test case.
  • tests/ut/distributed/test_communicator.py
    • Patched dist.distributed_c10d._world.pg_map with MagicMock during NPUCommunicator initialization in unit tests.
  • vllm_ascend/core/recompute_scheduler.py
    • Removed vllm._bc_linter import and @bc_linter_include decorator.
  • vllm_ascend/ops/mm_encoder_attention.py
    • Added sequence_lengths parameter to the forward_oot method.
  • vllm_ascend/patch/init.py
    • Documented the new patch_fusion_matcher_compat_ops.py file.
  • vllm_ascend/patch/platform/init.py
    • Imported the new patch_fusion_matcher_compat_ops module.
  • vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py
    • Added a new file to define a compatibility layer for missing torch.ops._C symbols.
  • vllm_ascend/worker/model_runner_v1.py
    • Imported vllm_version_is.
    • Modified dispatch_cudagraph to use version-specific arguments for cudagraph_dispatcher.dispatch.
  • vllm_ascend/worker/worker.py
    • Changed the return type of compile_or_warm_up_model to float.
    • Returned compilation_time from compile_or_warm_up_model.
Ignored Files
  • Ignored by pattern: .github/workflows/** (5)
    • .github/workflows/bot_pr_create.yaml
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
    • .github/workflows/schedule_codecov_refresh.yaml
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the vLLM dependency and introduces several compatibility changes. The changes include updating documentation, fixing tests, and adding compatibility shims for different vLLM versions. I've found an opportunity to improve maintainability by refactoring duplicated code in vllm_ascend/worker/model_runner_v1.py.

As per the repository's style guide, here are suggestions for the pull request title and summary:

Suggested PR Title:

[Ops][Misc] Upgrade vLLM dependency

Suggested PR Summary:

### What this PR does / why we need it?
This pull request upgrades the vLLM dependency to a newer version (from commit `e2b3124...`) and introduces several changes to ensure compatibility with the updated upstream code.

Key changes include:
- Updating the vLLM commit hash in the versioning policy documentation.
- Modifying test setups to align with changes in the vLLM testing framework.
- Removing deprecated code, such as `bc_linter_include`.
- Adding a compatibility patch (`patch_fusion_matcher_compat_ops.py`) to handle missing PyTorch operators on the Ascend platform, preventing import-time errors.
- Introducing version-specific logic in `NPUModelRunner` to handle API differences in `cudagraph_dispatcher` between vLLM versions.

These changes are necessary to keep `vllm-ascend` in sync with the latest developments in the core vLLM repository.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily consists of internal dependency upgrades and compatibility fixes. There are no user-facing API or behavior changes.

### How was this patch tested?
CI should pass. The changes include updates to unit tests to ensure they pass with the new vLLM version.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
Comment on lines +1829 to +1851
if vllm_version_is("0.16.0"):

def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
if force_eager:
return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))
return self.cudagraph_dispatcher.dispatch(
num_tokens=num_tokens,
has_lora=has_lora,
uniform_decode=uniform_decode,
disable_full=disable_full,
)
else:

def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
if force_eager:
return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))
return self.cudagraph_dispatcher.dispatch(
num_tokens=num_tokens,
has_lora=has_lora,
uniform_decode=uniform_decode,
valid_modes=valid_modes,
invalid_modes={CUDAGraphMode.FULL} if disable_full else None,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication in the dispatch_cudagraph function definition for the two vLLM version branches. This makes the code harder to maintain and increases the risk of introducing bugs if one branch is modified and the other is not. This can be refactored to define the function once and handle the version-specific logic inside.

        def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
            if force_eager:
                return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))

            common_args = {
                "num_tokens": num_tokens,
                "has_lora": has_lora,
                "uniform_decode": uniform_decode,
            }
            if vllm_version_is("0.16.0"):
                return self.cudagraph_dispatcher.dispatch(
                    **common_args,
                    disable_full=disable_full,
                )
            else:
                return self.cudagraph_dispatcher.dispatch(
                    **common_args,
                    valid_modes=valid_modes,
                    invalid_modes={CUDAGraphMode.FULL} if disable_full else None,
                )

@Potabk Potabk added ready read for review ready-for-test start test by label for PR labels Mar 5, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation module:tests module:ops labels Mar 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
@jikunshang
Copy link
Copy Markdown

may I know why and how empty_cache break on Ascend side?

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 9, 2026

may I know why and how empty_cache break on Ascend side?

Thanks for attention! We found that torch.accelerator.empty_cache seems to be not ready in torch_npu v2.9.0. And the CI met the same error before. cc @Yikun @wangxiyuan @MengqingCao @fffrog

We may need to check whether it's ready in torch_npu v2.10.0 and upgrade torch version. But before that, our main2main would be keeping breaking.

>>> import torch
>>> import torch_npu
/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch_npu/__init__.py:309: UserWarning: On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default.                      Do not set it to 1 to prevent some unknown errors
  warnings.warn("On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default. \
>>> torch.accelerator.empty_cache()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch/accelerator/memory.py", line 28, in empty_cache
    if not torch._C._accelerator_isAllocatorInitialized():
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device_allocator INTERNAL ASSERT FAILED at "/pytorch/c10/core/CachingDeviceAllocator.h":109, please report a bug to PyTorch. Allocator for npu is not a DeviceAllocator.

@jikunshang
Copy link
Copy Markdown

sorry to hear that. I thought that empty_cache should work on torch2.9...

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@MrZ20 MrZ20 closed this Mar 13, 2026
@MrZ20 MrZ20 deleted the main_0305 branch April 8, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation merge-conflicts module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants