Skip to content

[Model] Support Minimax-m2.5 on NPU#7105

Merged
wangxiyuan merged 3 commits intovllm-project:mainfrom
SparrowMu:main
Mar 10, 2026
Merged

[Model] Support Minimax-m2.5 on NPU#7105
wangxiyuan merged 3 commits intovllm-project:mainfrom
SparrowMu:main

Conversation

@SparrowMu
Copy link
Copy Markdown
Contributor

@SparrowMu SparrowMu commented Mar 10, 2026

What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend.
This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU.

Does this PR introduce any user-facing change?

How was this patch tested?

Test Report

Self tested precision summary, where the official precision score of AIME2025 is 86.3
image

@SparrowMu SparrowMu requested a review from wangxiyuan as a code owner March 10, 2026 04:16
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides the foundational support for running the Minimax-m2.5 model on NPU devices within the vllm-ascend ecosystem. It addresses compatibility challenges by implementing a mechanism to dequantize FP8 weights to BF16, ensuring proper model loading and execution. Additionally, it includes performance enhancements and configuration adjustments tailored for NPU, such as optimizing attention layers and managing environment variables for efficient graph capture.

Highlights

  • Minimax-m2.5 NPU Support: Enabled initial support for the Minimax-m2.5 model on NPU.
  • FP8 to BF16 Weight Conversion: Implemented automatic conversion of FP8 weights to BF16 during loading for Minimax-m2.5 on NPU.
  • Model Configuration Patching: Patched ModelConfig to disable FP8 quantization validation and set HCCL_OP_EXPANSION_MODE for NPU ACL graph capture.
  • MoE Distributed Processing: Modified MiniMaxM2MoE to utilize maybe_all_reduce_tensor_model_parallel for distributed processing.
  • Attention Mechanism Optimization: Updated MiniMaxM2Attention to handle num_kv_head_replicas and k_norm weight sharding for NPU.
  • Linear Attention RMSNorm Optimization: Introduced NPU-specific optimizations for MiniMaxText01RMSNormTP including kernelized RMSNorm and global rstd correction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/patch/platform/init.py
    • Imported patch_minimax_m2_config to integrate NPU-specific model configuration.
  • vllm_ascend/patch/platform/patch_minimax_m2_config.py
    • Added a new file to patch ModelConfig for Minimax-m2.5 on NPU, handling FP8 quantization and setting HCCL_OP_EXPANSION_MODE.
  • vllm_ascend/patch/worker/init.py
    • Imported patch_minimax_m2 and patch_minimax_m2_linear_attn to enable NPU-specific worker optimizations.
  • vllm_ascend/patch/worker/patch_minimax_m2.py
    • Added a new file to implement NPU-specific optimizations for MiniMaxM2MoE (all-reduce), MiniMaxM2Attention (weight sharding), and MiniMaxM2Model (FP8 dequantization).
  • vllm_ascend/patch/worker/patch_minimax_m2_linear_attn.py
    • Added a new file to introduce NPU-optimized MiniMaxText01RMSNormTP for linear attention, including kernelized RMSNorm and global rstd correction.
Activity
  • No human activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Minimax-m2.5 model on NPU by adding several patches. These patches handle fp8 dequantization, adjust for tensor parallelism in MoE and attention layers, and include hardware-specific optimizations. My review has identified a potential logical conflict in the quantization verification patch and a memory-inefficient operation during weight dequantization. Addressing these concerns will enhance the patch's robustness and performance.

Comment on lines +102 to +103
# Some versions may read self.quantization before calling platform verifier.
_disable_fp8(self, log=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a logical conflict in _patched_verify_quantization that makes the patch's behavior unclear and potentially incorrect. The call to _disable_fp8 on line 103 nullifies self.quantization before _original_verify_quantization is called. This is justified by the comment on line 102 for some vLLM versions.

However, this directly conflicts with the _platform_verify_hook mechanism. The hook will receive None as quant_method, causing the check _should_disable_fp8(self, quant_method) to always fail. This makes the core interception logic within the hook (lines 95-98) unreachable.

This implementation cannot simultaneously handle versions that require pre-emptive nullification of self.quantization and versions that rely on the hook to intercept the quantization method. Please clarify the intended logic and resolve this conflict. One approach might be to remove line 103 if the hook-based interception is the primary mechanism desired.

Comment on lines +128 to +131
expanded_scale = weight_scale_inv.repeat_interleave(
block_n, dim=0
).repeat_interleave(block_k, dim=1)
expanded_scale = expanded_scale[:n, :k].to(dtype=torch.bfloat16)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of repeat_interleave to create expanded_scale can be very memory-intensive, as it creates a temporary tensor of the same size as the full weight tensor. For large models, this could potentially lead to out-of-memory errors. Consider a more memory-efficient approach, such as iterating over blocks and applying the scaling factor, to avoid allocating this large intermediate tensor.

…overting original fp8 weight to a quantilized bf16.

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: limuyuan <limuyuan3@huawei.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
@wangxiyuan wangxiyuan merged commit 54668e7 into vllm-project:main Mar 10, 2026
36 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Mar 12, 2026
…to qwen3next_graph

* 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits)
  [main][bugfix] Fixed the problem of speculative decoding in FULL mode (vllm-project#7148)
  fixed fia pad logic in graph mode. (vllm-project#7144)
  [Doc] fix DSV3.1 PD configs (vllm-project#7187)
  refactor: add a check before layer_sharding logging (vllm-project#7186)
  [Build] Add support for Ascend950 chip (vllm-project#7151)
  Revert "[CI] fix skiped e2e test when upgrade vllm version  (vllm-project#6654)" (vllm-project#7166)
  [MODELRUNNERV2]fix penality ops (vllm-project#7013)
  [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (vllm-project#6650)
  [KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (vllm-project#7146)
  [CI] Build Image for v0.16.0rc1 (vllm-project#7155)
  [CI] Skip `test_mooncake_layerwise_connector.py` in `ut` (vllm-project#7147)
  [BugFix]Fix recomputed scheduler bug (vllm-project#7137)
  [Model] Support Minimax-m2.5 on NPU (vllm-project#7105)
  [P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (vllm-project#7022)
  Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (vllm-project#7109)
  [Doc][ReleaseNote] Add release notes for v0.16.0rc1 (vllm-project#7067)
  [Misc] Download on both hk and guiyang region (vllm-project#7129)
  [bugdix] The problem that the w4a8 weight fails to be loaded when the EP is not enabled is resolved. (vllm-project#7090)
  [eagle][cp] fix eagle_cp enable bug2 (vllm-project#7079)
  [CI]Upgrade niglty multi-node-tests max-parallel to 2 (vllm-project#7035)
  ...
liuchenbing2026 pushed a commit to liuchenbing2026/vllm-ascend that referenced this pull request Mar 16, 2026
Cherry-pick from upstream PR vllm-project#7105: MoE all_reduce, k_norm weight
sharding, fp8 load dequant, linear attn NPU RMSNorm, config patches.
Nagisa125 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 17, 2026
### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@4034c3d

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
liuchenbing2026 pushed a commit to liuchen20/vllm-ascend that referenced this pull request Mar 24, 2026
Cherry-pick from upstream PR vllm-project#7105: MoE all_reduce, k_norm weight
sharding, fp8 load dequant, linear attn NPU RMSNorm, config patches.
liuchenbing2026 pushed a commit to liuchen20/vllm-ascend that referenced this pull request Mar 24, 2026
Cherry-pick from upstream PR vllm-project#7105: MoE all_reduce, k_norm weight
sharding, fp8 load dequant, linear attn NPU RMSNorm, config patches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants