[Main2Main] Upgrade vLLM to 0305 by menogrey · Pull Request #7099 · vllm-project/vllm-ascend

menogrey · 2026-03-10T03:15:01Z

What this PR does / why we need it?

break:

[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache vllm#30681
clean unused cudagraph_batch_sizes vllm#35552 remove self.cudagraph_batch_sizes
[Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding vllm#35158 clear_metadata -> defer_finalize
[Misc] Remove deprecated items that are due for removal vllm#36006 remove CacheConfig.cpu_offload_gb
[torch.compile] Stop lazily compiling vllm#35472
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA vllm#34552
[UX][Startup] Account for CUDA graphs during memory profiling vllm#30515 profile_seq_lens

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@4034c3d

github-actions · 2026-03-10T03:15:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-10T03:15:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on upgrading the vLLM dependency to version 0.3.5 and adapting the Ascend-specific components to maintain compatibility. Key changes involve refining memory management for NPU devices by introducing a dedicated empty_cache patch and an NPU-aware memory profiling context manager. Additionally, the profiling process was optimized to avoid compilation issues specific to Ascend, ensuring more accurate memory estimations during model setup.

Highlights

vLLM Upgrade: The project has been upgraded to vLLM version 0.3.5, with the corresponding commit hash updated in the versioning policy documentation.
NPU-Specific Memory Management: A new patch was introduced to redirect torch.accelerator.empty_cache calls to torch.npu.empty_cache, ensuring proper memory handling on Ascend devices.
Profiling Optimization: Compilation and CUDAGraph modes are now temporarily disabled during profiling runs in ModelRunnerV1 to prevent FakeTensorMode mismatches and improve memory estimation accuracy on Ascend.
Dedicated NPU Memory Profiling: A new npu_memory_profiling context manager was added to NPUWorker to provide NPU-safe memory profiling, replacing the generic memory_profiling for Ascend-specific operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/source/community/versioning_policy.md
- Updated the vLLM commit hash reference to reflect the upgrade to 0.3.5.
vllm_ascend/patch/platform/init.py
- Imported the newly added patch_torch_accelerator module.
vllm_ascend/patch/platform/patch_torch_accelerator.py
- Added a new file to patch torch.accelerator.empty_cache to use torch.npu.empty_cache.
vllm_ascend/worker/model_runner_v1.py
- Modified the _model_forward call to include a skip_compiled flag for profiling.
- Temporarily disabled CompilationMode and CUDAGraphMode during profile_run to prevent FakeTensorMode mismatches on Ascend.
vllm_ascend/worker/worker.py
- Imported contextlib and Generator for the new context manager.
- Updated the import for memory_profiling to MemoryProfilingResult.
- Added a new npu_memory_profiling context manager for NPU-specific memory profiling.
- Switched the determine_available_memory method to use the new npu_memory_profiling context manager.

Ignored Files

Ignored by pattern: .github/workflows/** (6)
- .github/workflows/bot_pr_create.yaml
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml
- .github/workflows/schedule_codecov_refresh.yaml
- .github/workflows/schedule_update_estimated_time.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request upgrades the vLLM dependency and adapts the codebase to the changes. The main adaptations involve providing NPU-specific workarounds for memory profiling and disabling compilation during profiling runs to prevent errors on Ascend. The changes appear correct and robust.

I have provided suggestions for the pull request title and summary to align with the repository's style guide. Please see below.

Suggested PR Title:

[Main2Main][Core][Misc] Upgrade vLLM to commit 5b3ba94

Suggested PR Summary:

### What this PR does / why we need it?

This PR upgrades the vLLM dependency to commit `5b3ba94ab4bd9da739bcc27cdd05505467fa499e` to keep up with the latest upstream changes.

This upgrade introduces changes that require adaptation for the Ascend backend. Specifically:
- Upstream vLLM now uses `torch.accelerator.empty_cache()`, which is not supported on `torch-npu`. This PR introduces a patch to redirect this call to `torch.npu.empty_cache()` and also provides an NPU-specific memory profiling utility to avoid issues during memory profiling.
- Changes in upstream profiling logic could lead to `FakeTensorMode` mismatches on Ascend. This PR disables compilation and graph modes during profiling runs to prevent this.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed with existing tests.

github-actions · 2026-03-12T09:30:14Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-03-13T08:21:27Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-03-15T14:52:39Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com>

Signed-off-by: MrZ20 <2609716663@qq.com>

Signed-off-by: menogrey <1299267905@qq.com>

github-actions · 2026-03-18T01:26:56Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

menogrey · 2026-04-16T01:48:23Z

out-of-dated, close

menogrey requested review from LCAIZJ, MengqingCao, Yikun and wangxiyuan as code owners March 10, 2026 03:15

github-actions bot added documentation Improvements or additions to documentation ci/build labels Mar 10, 2026

menogrey added ready read for review ready-for-test start test by label for PR and removed documentation Improvements or additions to documentation ci/build labels Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

menogrey force-pushed the main2main branch from a3e7a66 to 304758c Compare March 10, 2026 13:07

menogrey requested a review from yiz-liu as a code owner March 10, 2026 13:07

github-actions bot added the merge-conflicts label Mar 12, 2026

menogrey force-pushed the main2main branch from 2ff9316 to a4afbd4 Compare March 12, 2026 10:02

github-actions bot added merge-conflicts and removed merge-conflicts labels Mar 12, 2026

menogrey force-pushed the main2main branch from 3eab83c to 296bc24 Compare March 13, 2026 09:59

github-actions bot added merge-conflicts and removed merge-conflicts labels Mar 13, 2026

menogrey and others added 5 commits March 16, 2026 10:01

fix torch.accelerator

cd5fb72

Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com>

update

f46ba60

Signed-off-by: MrZ20 <2609716663@qq.com>

Patch to fix torch.accelerator.empty_cache() error.

d1d5bbf

Signed-off-by: menogrey <1299267905@qq.com>

Fix self.cudagraph_batch_sizes error.

2572fba

Signed-off-by: menogrey <1299267905@qq.com>

Fix clear_metadata error.

76aa313

Signed-off-by: menogrey <1299267905@qq.com>

menogrey added 10 commits March 16, 2026 10:02

Fix FakeTensorMode mismatch error.

a004d50

Signed-off-by: menogrey <1299267905@qq.com>

Fix cpu_offload_gb error.

42689d7

Signed-off-by: menogrey <1299267905@qq.com>

Fix FakeTensorMode error.

2ea2b0a

Signed-off-by: menogrey <1299267905@qq.com>

Fix torch.ops.aten.gelu_.default error

2581e5a

Signed-off-by: menogrey <1299267905@qq.com>

Fix comment.

d17a737

Signed-off-by: menogrey <1299267905@qq.com>

Run full test until end.

4440c86

Signed-off-by: menogrey <1299267905@qq.com>

Fix attn_metadata_builder error.

41eb9cb

Signed-off-by: menogrey <1299267905@qq.com>

Revert some fix code.

36386ba

Signed-off-by: menogrey <1299267905@qq.com>

Fix multiproc_executor error.

640b87a

Signed-off-by: menogrey <1299267905@qq.com>

Adapt for 0.17.0 modify, remove some redundant fix.

66e0e47

Signed-off-by: menogrey <1299267905@qq.com>

menogrey force-pushed the main2main branch from 7a17ff2 to 66e0e47 Compare March 16, 2026 02:06

github-actions bot removed the merge-conflicts label Mar 16, 2026

menogrey added 2 commits March 16, 2026 10:14

Fix new code conflict.

2834012

Signed-off-by: menogrey <1299267905@qq.com>

Revert debug modify.

128929c

Signed-off-by: menogrey <1299267905@qq.com>

github-actions bot added the merge-conflicts label Mar 18, 2026

menogrey closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Main2Main] Upgrade vLLM to 0305#7099

[Main2Main] Upgrade vLLM to 0305#7099
menogrey wants to merge 17 commits intovllm-project:mainfrom
menogrey:main2main

menogrey commented Mar 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

menogrey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

menogrey commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

menogrey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

menogrey commented Mar 10, 2026 •

edited

Loading