[300I][Bugfix] fix unquant model weight nd2nz error#6851
[300I][Bugfix] fix unquant model weight nd2nz error#6851wangxiyuan merged 16 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug affecting unquantized model weights on Ascend 310P devices, ensuring they are correctly formatted for optimal performance and preventing Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the weight conversion logic for unquantized models on Ascend 310P devices. The platform-specific logic for converting weights to the FRACTAL_NZ format is moved from a dedicated 310P linear layer implementation into the generic maybe_trans_nz utility function. This simplifies the codebase by removing duplicated logic and fixes an issue with weight conversion. The change is a good improvement.
I have one suggestion to improve code maintainability by updating an outdated comment in vllm_ascend/utils.py.
As per the repository's style guide, I'm also suggesting an updated title and summary for this pull request:
Suggested PR Title:
[310P][Ops][BugFix] Fix unquantized model weight NZ conversionSuggested PR Summary:
### What this PR does / why we need it?
This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices.
The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights.
This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed. The change is a refactoring that preserves existing behavior for 310P devices while improving code structure.|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
436e697 to
4d90ae0
Compare
| class _AscendParallelLMHead310QuantMethod(QuantizeMethodBase): | ||
| def __init__(self, base_method: QuantizeMethodBase): | ||
| self._base_method = base_method | ||
|
|
||
| def __getattr__(self, name: str): | ||
| return getattr(self._base_method, name) |
There was a problem hiding this comment.
已这种非常规继承类有如下风险:
- isinstance(UnquantizedEmbeddingMethod)会报False,为后续潜在的框架优化埋雷。
- 如果后续UnquantizedEmbeddingMethod基类的create_weights,apply等方法新增入参,此处会出现不适配问题。
- 如果某一天有其他类需要继承AscendParallelLMHead310QuantMethod,super().init()方法无法起到预期效果。
- getattr 在每次属性访问时都会触发,比真正的继承增加性能开销。
| from vllm_ascend.utils import maybe_trans_nz | ||
|
|
||
|
|
||
| class _AscendParallelLMHead310QuantMethod(QuantizeMethodBase): |
There was a problem hiding this comment.
类命名前加下划线_不合适,此处应为AscendUnquantizedEmbeddingMethod310,与其他类名命名风格保持一致。
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
28e014e to
2f50b65
Compare
…to qwen3next_graph * 'main' of https://github.com/vllm-project/vllm-ascend: (40 commits) [Feature] Add docs of batch invariance and make some extra operators patch (vllm-project#6910) [bugfix]Qwen2.5VL accurate question (vllm-project#6975) [CI] Add DeepSeek-V3.2 large EP nightly ci (vllm-project#6378) [Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (vllm-project#6939) [bugfix]fix file not found error in nightly of single-node (vllm-project#6976) [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (vllm-project#6914) [CI] Enable auto upgrade e2e estimated time for auto-partition suites (vllm-project#6840) [Doc][Misc] Fix msprobe_guide.md documentation issues (vllm-project#6965) [Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (vllm-project#6503) [BugFix] Improve GDN layer detection for multimodal models (vllm-project#6941) [feat]ds3.2 pcp support mtp and chunkprefill (vllm-project#6917) [CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (vllm-project#6945) [Triton] Centralize Ascend extension op dispatch in triton_utils (vllm-project#6937) [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (vllm-project#6936) [300I][Bugfix] fix unquant model weight nd2nz error (vllm-project#6851) [doc] fix supported_models (vllm-project#6930) [CI] nightly test timeout (vllm-project#6912) [CI] Upgrade CANN to 8.5.1 (vllm-project#6897) [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (vllm-project#6828) [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (vllm-project#6898) ...
### What this PR does / why we need it? - This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices. - The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights. - This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and local test - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@83b47f6 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>
What this PR does / why we need it?
This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices.
The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (
AscendUnquantizedLinearMethod310). This implementation has been removed, and the logic is now centralized in themaybe_trans_nzutility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly forfloat16/bfloat16weights.This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P.
Does this PR introduce any user-facing change?
No
How was this patch tested?
ut and local test