Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on upgrading the vLLM integration to align with version 0.3.5, as indicated by the updated versioning policy. It introduces crucial compatibility patches for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request upgrades the vLLM dependency and introduces several compatibility changes. The changes include updating documentation, fixing tests, and adding compatibility shims for different vLLM versions. I've found an opportunity to improve maintainability by refactoring duplicated code in vllm_ascend/worker/model_runner_v1.py.
As per the repository's style guide, here are suggestions for the pull request title and summary:
Suggested PR Title:
[Ops][Misc] Upgrade vLLM dependencySuggested PR Summary:
### What this PR does / why we need it?
This pull request upgrades the vLLM dependency to a newer version (from commit `e2b3124...`) and introduces several changes to ensure compatibility with the updated upstream code.
Key changes include:
- Updating the vLLM commit hash in the versioning policy documentation.
- Modifying test setups to align with changes in the vLLM testing framework.
- Removing deprecated code, such as `bc_linter_include`.
- Adding a compatibility patch (`patch_fusion_matcher_compat_ops.py`) to handle missing PyTorch operators on the Ascend platform, preventing import-time errors.
- Introducing version-specific logic in `NPUModelRunner` to handle API differences in `cudagraph_dispatcher` between vLLM versions.
These changes are necessary to keep `vllm-ascend` in sync with the latest developments in the core vLLM repository.
### Does this PR introduce _any_ user-facing change?
No, this PR primarily consists of internal dependency upgrades and compatibility fixes. There are no user-facing API or behavior changes.
### How was this patch tested?
CI should pass. The changes include updates to unit tests to ensure they pass with the new vLLM version.| if vllm_version_is("0.16.0"): | ||
|
|
||
| def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None): | ||
| if force_eager: | ||
| return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded)) | ||
| return self.cudagraph_dispatcher.dispatch( | ||
| num_tokens=num_tokens, | ||
| has_lora=has_lora, | ||
| uniform_decode=uniform_decode, | ||
| disable_full=disable_full, | ||
| ) | ||
| else: | ||
|
|
||
| def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None): | ||
| if force_eager: | ||
| return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded)) | ||
| return self.cudagraph_dispatcher.dispatch( | ||
| num_tokens=num_tokens, | ||
| has_lora=has_lora, | ||
| uniform_decode=uniform_decode, | ||
| valid_modes=valid_modes, | ||
| invalid_modes={CUDAGraphMode.FULL} if disable_full else None, | ||
| ) |
There was a problem hiding this comment.
There is significant code duplication in the dispatch_cudagraph function definition for the two vLLM version branches. This makes the code harder to maintain and increases the risk of introducing bugs if one branch is modified and the other is not. This can be refactored to define the function once and handle the version-specific logic inside.
def dispatch_cudagraph(num_tokens, disable_full=False, valid_modes=None):
if force_eager:
return (CUDAGraphMode.NONE, BatchDescriptor(num_tokens_padded))
common_args = {
"num_tokens": num_tokens,
"has_lora": has_lora,
"uniform_decode": uniform_decode,
}
if vllm_version_is("0.16.0"):
return self.cudagraph_dispatcher.dispatch(
**common_args,
disable_full=disable_full,
)
else:
return self.cudagraph_dispatcher.dispatch(
**common_args,
valid_modes=valid_modes,
invalid_modes={CUDAGraphMode.FULL} if disable_full else None,
)|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: MrZ20 <2609716663@qq.com>
|
may I know why and how empty_cache break on Ascend side? |
Thanks for attention! We found that We may need to check whether it's ready in torch_npu v2.10.0 and upgrade torch version. But before that, our main2main would be keeping breaking. >>> import torch
>>> import torch_npu
/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch_npu/__init__.py:309: UserWarning: On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default. Do not set it to 1 to prevent some unknown errors
warnings.warn("On the interactive interface, the value of TASK_QUEUE_ENABLE is set to 0 by default. \
>>> torch.accelerator.empty_cache()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/vllm-workspace2/.venv/lib/python3.11/site-packages/torch/accelerator/memory.py", line 28, in empty_cache
if not torch._C._accelerator_isAllocatorInitialized():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: device_allocator INTERNAL ASSERT FAILED at "/pytorch/c10/core/CachingDeviceAllocator.h":109, please report a bug to PyTorch. Allocator for npu is not a DeviceAllocator. |
|
sorry to hear that. I thought that empty_cache should work on torch2.9... |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
break:
torch.cuda.empty_cachewithtorch.accelerator.empty_cachevllm#30681Does this PR introduce any user-facing change?
How was this patch tested?