Main2main upgrade to vllm 0317 afternoon#7409
Conversation
Signed-off-by: leo-pony <nengjunma@outlook.com>
Root causes: - CompilationConfig.compile_ranges_split_points renamed to compile_ranges_endpoints (4b87ffb) - torch.accelerator.memory_stats/reserved not supported on NPU (747b068) - get_attn_backend() removed block_size parameter (77a7345) Upstream commit range: 4034c3d..43a73f8 Signed-off-by: leo-pony <nengjunma@outlook.com> Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
- Restore use_sparse_c8_indexer initialization in NPUModelRunner that was dropped during rebase - Guard deepstack_num_level, mrope_section, mrope_interleaved with hasattr checks since xlite C++ ModelConfig may not have these attrs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
…le3 refactor Upstream vLLM commit 8b34630 (Consolidate SupportsEagle #36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. Update model_runner_v1.py to match upstream: add supports_eagle3 check and use the new method name to fix AttributeError on Qwen3ForCausalLM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
Upstream vLLM commit cfaf466 (Support multiple KV groups in OffloadingSpec #36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. Update NPUOffloadingSpec.get_manager() and get_handlers() to match the new API: extract gpu_block_size[0] and compute offloaded_block_size via gpu_block_size * block_size_factor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
The sparse_head_dim tuple (kv_lora_rank, qk_rope_head_dim, index_head_dim) was dropped during rebase but is required by get_kv_cache_spec() when use_sparse is True (DSv3.1 sparse MLA models). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
…0 handle Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on upgrading the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request upgrades vLLM compatibility by introducing version checks and conditional logic to handle API differences. The changes are mostly correct, but I've identified a critical issue where a safety check was removed, potentially causing an AttributeError. I've also pointed out several instances of code duplication that could be refactored to improve maintainability.
Additionally, the pull request title and description do not follow the repository's style guide. I suggest updating them to improve clarity and consistency.
Suggested PR Title:
[main][Misc][Upgrade] Upgrade vLLM compatibility
Suggested PR Summary:
### What this PR does / why we need it?
This PR updates the codebase to be compatible with a newer version of vLLM (commit `8a680463fab3bc9e6760417cd5c0a6aa58283065`). The changes primarily involve:
- Adding version checks and conditional logic to handle API differences in `ascend_config.py`, `kv_offload/npu.py`, and `worker/model_runner_v1.py`.
- Monkey-patching `torch.accelerator` in `platform.py` for NPU compatibility.
- Updating documentation and commit hashes.
- Temporarily skipping a failing test in `test_disaggregated_encoder.py`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI will be used to test the changes.|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
What this PR does / why we need it?
1.fix "TypeError: get_attn_backend() remove variable": Refactor
check_and_update_config2.fix Rename
compile_ranges_split_pointstocompile_ranges_endpoints3.fix "RuntimeError: device_allocator not a DeviceAllocator":Replace memory related torch.cuda APIs"
4.fix Support multiple KV groups in OffloadingSpec
removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor.
5.fix Consolidate SupportsEagle renamed
get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it.
Does this PR introduce any user-facing change?
NA
How was this patch tested?
E2E