Skip to content

Main2main upgrade to vllm 0317 afternoon#7409

Merged
wangxiyuan merged 15 commits intovllm-project:mainfrom
leo-pony:main2main_0317
Mar 18, 2026
Merged

Main2main upgrade to vllm 0317 afternoon#7409
wangxiyuan merged 15 commits intovllm-project:mainfrom
leo-pony:main2main_0317

Conversation

@leo-pony
Copy link
Copy Markdown
Collaborator

@leo-pony leo-pony commented Mar 18, 2026

What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": Refactor check_and_update_config

2.fix Rename compile_ranges_split_points to compile_ranges_endpoints

3.fix "RuntimeError: device_allocator not a DeviceAllocator":Replace memory related torch.cuda APIs"

4.fix Support multiple KV groups in OffloadingSpec
removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor.

5.fix Consolidate SupportsEagle renamed
get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

E2E

leo-pony and others added 11 commits March 18, 2026 02:55
Signed-off-by: leo-pony <nengjunma@outlook.com>
Root causes:
- CompilationConfig.compile_ranges_split_points renamed to compile_ranges_endpoints (4b87ffb)
- torch.accelerator.memory_stats/reserved not supported on NPU (747b068)
- get_attn_backend() removed block_size parameter (77a7345)

Upstream commit range: 4034c3d..43a73f8

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
- Restore use_sparse_c8_indexer initialization in NPUModelRunner that
  was dropped during rebase
- Guard deepstack_num_level, mrope_section, mrope_interleaved with
  hasattr checks since xlite C++ ModelConfig may not have these attrs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
…le3 refactor

Upstream vLLM commit 8b34630 (Consolidate SupportsEagle #36063) renamed
get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers()
and added a supports_eagle3() guard before calling it.
Update model_runner_v1.py to match upstream: add supports_eagle3 check and
use the new method name to fix AttributeError on Qwen3ForCausalLM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
Upstream vLLM commit cfaf466 (Support multiple KV groups in OffloadingSpec
#36610) removed self.offloaded_block_size and changed self.gpu_block_size
from a scalar to a tuple of per-group block sizes, adding block_size_factor.
Update NPUOffloadingSpec.get_manager() and get_handlers() to match the new
API: extract gpu_block_size[0] and compute offloaded_block_size via
gpu_block_size * block_size_factor.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
The sparse_head_dim tuple (kv_lora_rank, qk_rope_head_dim, index_head_dim)
was dropped during rebase but is required by get_kv_cache_spec() when
use_sparse is True (DSv3.1 sparse MLA models).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
…0 handle

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on upgrading the vllm-ascend project to align with a newer version of vLLM, specifically v0.17.0. The changes involve adapting various components to maintain compatibility with the updated vLLM codebase, including conditional logic for compilation, KV offloading, and attention mechanisms. Additionally, it addresses NPU-specific memory management by patching PyTorch APIs and updates internal documentation to reflect the new version alignment.

Highlights

  • vLLM Version Compatibility: Introduced version-dependent logic in ascend_config.py, kv_offload/npu.py, and worker/model_runner_v1.py to handle differences between vLLM versions, specifically targeting v0.17.0 for compilation ranges, KV offloading, and attention backend initialization.
  • NPU Memory API Compatibility: Implemented a monkey-patch in platform.py to redirect torch.accelerator memory APIs to their torch.npu equivalents, ensuring proper memory statistics and management for NPU devices.
  • Documentation and Metadata Updates: Updated the versioning_policy.md and worker/v2/README.md files to reflect the latest supported vLLM main branch commit hash.
  • Test Temporarily Skipped: A specific end-to-end test (test_disaggregated_encoder.py) was temporarily skipped due to an identified issue with EPLB output differences.
  • Robustness Improvements: Added hasattr checks in xlite.py before attempting to set deepstack_num_level, mrope_section, and mrope_interleaved attributes, improving code robustness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (5)
    • .github/workflows/bot_pr_create.yaml
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
    • .github/workflows/schedule_codecov_refresh.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades vLLM compatibility by introducing version checks and conditional logic to handle API differences. The changes are mostly correct, but I've identified a critical issue where a safety check was removed, potentially causing an AttributeError. I've also pointed out several instances of code duplication that could be refactored to improve maintainability.

Additionally, the pull request title and description do not follow the repository's style guide. I suggest updating them to improve clarity and consistency.

Suggested PR Title:

[main][Misc][Upgrade] Upgrade vLLM compatibility

Suggested PR Summary:

### What this PR does / why we need it?
This PR updates the codebase to be compatible with a newer version of vLLM (commit `8a680463fab3bc9e6760417cd5c0a6aa58283065`). The changes primarily involve:
- Adding version checks and conditional logic to handle API differences in `ascend_config.py`, `kv_offload/npu.py`, and `worker/model_runner_v1.py`.
- Monkey-patching `torch.accelerator` in `platform.py` for NPU compatibility.
- Updating documentation and commit hashes.
- Temporarily skipping a failing test in `test_disaggregated_encoder.py`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI will be used to test the changes.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
Comment thread vllm_ascend/ascend_config.py Outdated
Comment thread vllm_ascend/kv_offload/npu.py
Comment thread vllm_ascend/kv_offload/npu.py
@github-actions github-actions Bot added documentation Improvements or additions to documentation ci/build module:tests module:core labels Mar 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@leo-pony leo-pony added ready read for review ready-for-test start test by label for PR labels Mar 18, 2026
Signed-off-by: leo-pony <nengjunma@outlook.com>
Comment thread vllm_ascend/worker/v2/README.md
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
@wangxiyuan wangxiyuan merged commit 8b79d4d into vllm-project:main Mar 18, 2026
38 checks passed
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 25, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
lihaokun-2026 pushed a commit to lihaokun-2026/vllm-ascend that referenced this pull request Mar 29, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Apr 1, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation module:core module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants