Main2main upgrade to vllm 0317 afternoon by leo-pony · Pull Request #7409 · vllm-project/vllm-ascend

leo-pony · 2026-03-18T04:15:44Z

What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": Refactor check_and_update_config

2.fix Rename compile_ranges_split_points to compile_ranges_endpoints

3.fix "RuntimeError: device_allocator not a DeviceAllocator":Replace memory related torch.cuda APIs"

4.fix Support multiple KV groups in OffloadingSpec
removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor.

5.fix Consolidate SupportsEagle renamed
get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

E2E

vLLM version: v0.17.0
vLLM main: vllm-project/vllm@8a68046

Signed-off-by: leo-pony <nengjunma@outlook.com>

Root causes: - CompilationConfig.compile_ranges_split_points renamed to compile_ranges_endpoints (4b87ffb) - torch.accelerator.memory_stats/reserved not supported on NPU (747b068) - get_attn_backend() removed block_size parameter (77a7345) Upstream commit range: 4034c3d..43a73f8 Signed-off-by: leo-pony <nengjunma@outlook.com> Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

- Restore use_sparse_c8_indexer initialization in NPUModelRunner that was dropped during rebase - Guard deepstack_num_level, mrope_section, mrope_interleaved with hasattr checks since xlite C++ ModelConfig may not have these attrs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

…le3 refactor Upstream vLLM commit 8b34630 (Consolidate SupportsEagle #36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. Update model_runner_v1.py to match upstream: add supports_eagle3 check and use the new method name to fix AttributeError on Qwen3ForCausalLM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

Upstream vLLM commit cfaf466 (Support multiple KV groups in OffloadingSpec #36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. Update NPUOffloadingSpec.get_manager() and get_handlers() to match the new API: extract gpu_block_size[0] and compute offloaded_block_size via gpu_block_size * block_size_factor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

The sparse_head_dim tuple (kv_lora_rank, qk_rope_head_dim, index_head_dim) was dropped during rebase but is required by get_kv_cache_spec() when use_sparse is True (DSv3.1 sparse MLA models). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

…0 handle Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>

gemini-code-assist · 2026-03-18T04:16:03Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on upgrading the vllm-ascend project to align with a newer version of vLLM, specifically v0.17.0. The changes involve adapting various components to maintain compatibility with the updated vLLM codebase, including conditional logic for compilation, KV offloading, and attention mechanisms. Additionally, it addresses NPU-specific memory management by patching PyTorch APIs and updates internal documentation to reflect the new version alignment.

Highlights

vLLM Version Compatibility: Introduced version-dependent logic in ascend_config.py, kv_offload/npu.py, and worker/model_runner_v1.py to handle differences between vLLM versions, specifically targeting v0.17.0 for compilation ranges, KV offloading, and attention backend initialization.
NPU Memory API Compatibility: Implemented a monkey-patch in platform.py to redirect torch.accelerator memory APIs to their torch.npu equivalents, ensuring proper memory statistics and management for NPU devices.
Documentation and Metadata Updates: Updated the versioning_policy.md and worker/v2/README.md files to reflect the latest supported vLLM main branch commit hash.
Test Temporarily Skipped: A specific end-to-end test (test_disaggregated_encoder.py) was temporarily skipped due to an identified issue with EPLB output differences.
Robustness Improvements: Added hasattr checks in xlite.py before attempting to set deepstack_num_level, mrope_section, and mrope_interleaved attributes, improving code robustness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (5)
- .github/workflows/bot_pr_create.yaml
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml
- .github/workflows/schedule_codecov_refresh.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request upgrades vLLM compatibility by introducing version checks and conditional logic to handle API differences. The changes are mostly correct, but I've identified a critical issue where a safety check was removed, potentially causing an AttributeError. I've also pointed out several instances of code duplication that could be refactored to improve maintainability.

Additionally, the pull request title and description do not follow the repository's style guide. I suggest updating them to improve clarity and consistency.

Suggested PR Title:

[main][Misc][Upgrade] Upgrade vLLM compatibility

Suggested PR Summary:

### What this PR does / why we need it?
This PR updates the codebase to be compatible with a newer version of vLLM (commit `8a680463fab3bc9e6760417cd5c0a6aa58283065`). The changes primarily involve:
- Adding version checks and conditional logic to handle API differences in `ascend_config.py`, `kv_offload/npu.py`, and `worker/model_runner_v1.py`.
- Monkey-patching `torch.accelerator` in `platform.py` for NPU compatibility.
- Updating documentation and commit hashes.
- Temporarily skipping a failing test in `test_disaggregated_encoder.py`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI will be used to test the changes.

github-actions · 2026-03-18T04:25:22Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>

leo-pony and others added 11 commits March 18, 2026 02:55

upgrade to 0316

a130214

Signed-off-by: leo-pony <nengjunma@outlook.com>

set continue run when failed

60ead91

Signed-off-by: leo-pony <nengjunma@outlook.com>

upgrade to 3-17 afternoon

c60dadb

Signed-off-by: leo-pony <nengjunma@outlook.com>

restore continue on error to false

bc8aecc

Signed-off-by: leo-pony <nengjunma@outlook.com>

offloaded_block_size and eagle3 aux hidden state fix compatible 0.17.…

26637c8

…0 handle Signed-off-by: leo-pony <nengjunma@outlook.com>

CI format fix

95989f0

Signed-off-by: leo-pony <nengjunma@outlook.com>

leo-pony requested review from LCAIZJ, MengqingCao, Yikun, nalinaly and wangxiyuan as code owners March 18, 2026 04:15

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated

Comment thread vllm_ascend/ascend_config.py Outdated

Comment thread vllm_ascend/kv_offload/npu.py

Comment thread vllm_ascend/kv_offload/npu.py

github-actions Bot added documentation Improvements or additions to documentation ci/build module:tests module:core labels Mar 18, 2026

leo-pony added ready read for review ready-for-test start test by label for PR labels Mar 18, 2026

ci format fix

cd9be41

Signed-off-by: leo-pony <nengjunma@outlook.com>

wangxiyuan approved these changes Mar 18, 2026

View reviewed changes

Comment thread vllm_ascend/worker/v2/README.md

leo-pony added 3 commits March 18, 2026 10:59

fix skip eplb error

a98f62b

Signed-off-by: leo-pony <nengjunma@outlook.com>

fix commit id error for worker v2

305798d

Signed-off-by: leo-pony <nengjunma@outlook.com>

fix commit id error for worker v2 readme

b155f18

Signed-off-by: leo-pony <nengjunma@outlook.com>

wangxiyuan merged commit 8b79d4d into vllm-project:main Mar 18, 2026
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main2main upgrade to vllm 0317 afternoon#7409

Main2main upgrade to vllm 0317 afternoon#7409
wangxiyuan merged 15 commits intovllm-project:mainfrom
leo-pony:main2main_0317

leo-pony commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leo-pony commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leo-pony commented Mar 18, 2026 •

edited

Loading