Skip to content

[300I][Bugfix] fix unquant model weight nd2nz error#6851

Merged
wangxiyuan merged 16 commits intovllm-project:mainfrom
Tflowers-0129:fixnd2nznew
Mar 3, 2026
Merged

[300I][Bugfix] fix unquant model weight nd2nz error#6851
wangxiyuan merged 16 commits intovllm-project:mainfrom
Tflowers-0129:fixnd2nznew

Conversation

@Tflowers-0129
Copy link
Copy Markdown
Contributor

@Tflowers-0129 Tflowers-0129 commented Feb 27, 2026

What this PR does / why we need it?

  • This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices.

  • The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (AscendUnquantizedLinearMethod310). This implementation has been removed, and the logic is now centralized in the maybe_trans_nz utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for float16/bfloat16 weights.

  • This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P.

Does this PR introduce any user-facing change?

No

How was this patch tested?

ut and local test

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug affecting unquantized model weights on Ascend 310P devices, ensuring they are correctly formatted for optimal performance and preventing nd2nz errors. Concurrently, it includes a refactoring effort to generalize the AscendUnquantizedLinearMethod by relocating it to a broader operational module, streamlining the codebase.

Highlights

  • Bug Fix: Unquantized Model Weight Formatting: Resolved an issue where unquantized bf16/fp16 model weights on Ascend 310P devices were not consistently cast to the ACL_FORMAT_FRACTAL_NZ format, which could lead to nd2nz errors during model execution.
  • Code Refactoring: Refactored the AscendUnquantizedLinearMethod by moving its definition from a 310P-specific file to a more general vllm_ascend.ops.linear module, improving code organization and reusability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/_310p/ops/linear.py
    • Removed the file containing 310P-specific linear operation definitions, as its functionality was generalized.
  • vllm_ascend/_310p/quantization/modelslim_config.py
    • Updated the import statement for AscendUnquantizedLinearMethod to reflect its new, generalized location.
    • Modified the instantiation of AscendUnquantizedLinearMethod310 to AscendUnquantizedLinearMethod.
  • vllm_ascend/utils.py
    • Extended the condition in maybe_trans_nz to ensure that bf16/fp16 weights are always cast to ACL_FORMAT_FRACTAL_NZ when running on 310P devices, in addition to the existing environment variable check.
Activity
  • No human activity has been recorded on this pull request yet, as the description is empty.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the weight conversion logic for unquantized models on Ascend 310P devices. The platform-specific logic for converting weights to the FRACTAL_NZ format is moved from a dedicated 310P linear layer implementation into the generic maybe_trans_nz utility function. This simplifies the codebase by removing duplicated logic and fixes an issue with weight conversion. The change is a good improvement.

I have one suggestion to improve code maintainability by updating an outdated comment in vllm_ascend/utils.py.

As per the repository's style guide, I'm also suggesting an updated title and summary for this pull request:

Suggested PR Title:

[310P][Ops][BugFix] Fix unquantized model weight NZ conversion

Suggested PR Summary:

### What this PR does / why we need it?
This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices.

The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights.

This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed. The change is a refactoring that preserves existing behavior for 310P devices while improving code structure.

Comment thread vllm_ascend/utils.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Comment thread vllm_ascend/_310p/ops/vocab_parallel_embedding.py Outdated
Comment thread vllm_ascend/utils.py
Comment on lines +26 to +31
class _AscendParallelLMHead310QuantMethod(QuantizeMethodBase):
def __init__(self, base_method: QuantizeMethodBase):
self._base_method = base_method

def __getattr__(self, name: str):
return getattr(self._base_method, name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已这种非常规继承类有如下风险:

  1. isinstance(UnquantizedEmbeddingMethod)会报False,为后续潜在的框架优化埋雷。
  2. 如果后续UnquantizedEmbeddingMethod基类的create_weights,apply等方法新增入参,此处会出现不适配问题。
  3. 如果某一天有其他类需要继承AscendParallelLMHead310QuantMethod,super().init()方法无法起到预期效果。
  4. getattr 在每次属性访问时都会触发,比真正的继承增加性能开销。

from vllm_ascend.utils import maybe_trans_nz


class _AscendParallelLMHead310QuantMethod(QuantizeMethodBase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类命名前加下划线_不合适,此处应为AscendUnquantizedEmbeddingMethod310,与其他类名命名风格保持一致。

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
@wangxiyuan wangxiyuan merged commit 2064afe into vllm-project:main Mar 3, 2026
25 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Mar 5, 2026
…to qwen3next_graph

* 'main' of https://github.com/vllm-project/vllm-ascend: (40 commits)
  [Feature] Add docs of batch invariance and make some extra operators patch (vllm-project#6910)
  [bugfix]Qwen2.5VL accurate question (vllm-project#6975)
  [CI] Add DeepSeek-V3.2 large EP nightly ci (vllm-project#6378)
  [Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (vllm-project#6939)
  [bugfix]fix file not found error in nightly of single-node (vllm-project#6976)
  [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (vllm-project#6914)
  [CI] Enable auto upgrade e2e estimated time for auto-partition suites (vllm-project#6840)
  [Doc][Misc] Fix msprobe_guide.md documentation issues (vllm-project#6965)
  [Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (vllm-project#6503)
  [BugFix] Improve GDN layer detection for multimodal models (vllm-project#6941)
  [feat]ds3.2 pcp support mtp and chunkprefill (vllm-project#6917)
  [CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (vllm-project#6945)
  [Triton] Centralize Ascend extension op dispatch in triton_utils (vllm-project#6937)
  [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (vllm-project#6936)
  [300I][Bugfix] fix unquant model weight nd2nz error (vllm-project#6851)
  [doc] fix supported_models (vllm-project#6930)
  [CI] nightly test timeout (vllm-project#6912)
  [CI] Upgrade CANN to 8.5.1 (vllm-project#6897)
  [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (vllm-project#6828)
  [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (vllm-project#6898)
  ...
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
- This PR fixes an issue with weight format conversion for unquantized
models running on Ascend 310P devices.

- The changes refactor the logic for converting weights to the
FRACTAL_NZ format. Previously, this was handled in a 310P-specific
linear layer implementation (`AscendUnquantizedLinearMethod310`). This
implementation has been removed, and the logic is now centralized in the
`maybe_trans_nz` utility function. This function now checks if the
device is a 310P and applies the NZ format cast accordingly for
`float16`/`bfloat16` weights.

- This refactoring simplifies the code by removing platform-specific
duplication and ensures correct weight handling for unquantized models
on 310P.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ut and local test
- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@83b47f6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants