[Model] Support Minimax-m2.5 on NPU by SparrowMu · Pull Request #7105 · vllm-project/vllm-ascend

SparrowMu · 2026-03-10T04:16:14Z

What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend.
This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@4034c3d

Test Report

Self tested precision summary, where the official precision score of AIME2025 is 86.3

github-actions · 2026-03-10T04:16:26Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-10T04:16:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides the foundational support for running the Minimax-m2.5 model on NPU devices within the vllm-ascend ecosystem. It addresses compatibility challenges by implementing a mechanism to dequantize FP8 weights to BF16, ensuring proper model loading and execution. Additionally, it includes performance enhancements and configuration adjustments tailored for NPU, such as optimizing attention layers and managing environment variables for efficient graph capture.

Highlights

Minimax-m2.5 NPU Support: Enabled initial support for the Minimax-m2.5 model on NPU.
FP8 to BF16 Weight Conversion: Implemented automatic conversion of FP8 weights to BF16 during loading for Minimax-m2.5 on NPU.
Model Configuration Patching: Patched ModelConfig to disable FP8 quantization validation and set HCCL_OP_EXPANSION_MODE for NPU ACL graph capture.
MoE Distributed Processing: Modified MiniMaxM2MoE to utilize maybe_all_reduce_tensor_model_parallel for distributed processing.
Attention Mechanism Optimization: Updated MiniMaxM2Attention to handle num_kv_head_replicas and k_norm weight sharding for NPU.
Linear Attention RMSNorm Optimization: Introduced NPU-specific optimizations for MiniMaxText01RMSNormTP including kernelized RMSNorm and global rstd correction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/patch/platform/init.py
- Imported patch_minimax_m2_config to integrate NPU-specific model configuration.
vllm_ascend/patch/platform/patch_minimax_m2_config.py
- Added a new file to patch ModelConfig for Minimax-m2.5 on NPU, handling FP8 quantization and setting HCCL_OP_EXPANSION_MODE.
vllm_ascend/patch/worker/init.py
- Imported patch_minimax_m2 and patch_minimax_m2_linear_attn to enable NPU-specific worker optimizations.
vllm_ascend/patch/worker/patch_minimax_m2.py
- Added a new file to implement NPU-specific optimizations for MiniMaxM2MoE (all-reduce), MiniMaxM2Attention (weight sharding), and MiniMaxM2Model (FP8 dequantization).
vllm_ascend/patch/worker/patch_minimax_m2_linear_attn.py
- Added a new file to introduce NPU-optimized MiniMaxText01RMSNormTP for linear attention, including kernelized RMSNorm and global rstd correction.

Activity

No human activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the Minimax-m2.5 model on NPU by adding several patches. These patches handle fp8 dequantization, adjust for tensor parallelism in MoE and attention layers, and include hardware-specific optimizations. My review has identified a potential logical conflict in the quantization verification patch and a memory-inefficient operation during weight dequantization. Addressing these concerns will enhance the patch's robustness and performance.

gemini-code-assist · 2026-03-10T04:18:42Z

+    # Some versions may read self.quantization before calling platform verifier.
+    _disable_fp8(self, log=True)


There is a logical conflict in _patched_verify_quantization that makes the patch's behavior unclear and potentially incorrect. The call to _disable_fp8 on line 103 nullifies self.quantization before _original_verify_quantization is called. This is justified by the comment on line 102 for some vLLM versions.

However, this directly conflicts with the _platform_verify_hook mechanism. The hook will receive None as quant_method, causing the check _should_disable_fp8(self, quant_method) to always fail. This makes the core interception logic within the hook (lines 95-98) unreachable.

This implementation cannot simultaneously handle versions that require pre-emptive nullification of self.quantization and versions that rely on the hook to intercept the quantization method. Please clarify the intended logic and resolve this conflict. One approach might be to remove line 103 if the hook-based interception is the primary mechanism desired.

gemini-code-assist · 2026-03-10T04:18:42Z

+    expanded_scale = weight_scale_inv.repeat_interleave(
+        block_n, dim=0
+    ).repeat_interleave(block_k, dim=1)
+    expanded_scale = expanded_scale[:n, :k].to(dtype=torch.bfloat16)


The use of repeat_interleave to create expanded_scale can be very memory-intensive, as it creates a temporary tensor of the same size as the full weight tensor. For large models, this could potentially lead to out-of-memory errors. Consider a more memory-efficient approach, such as iterating over blocks and applying the scaling factor, to avoid allocating this large intermediate tensor.

…overting original fp8 weight to a quantilized bf16. Signed-off-by: limuyuan <limuyuan3@huawei.com>

Signed-off-by: limuyuan <limuyuan3@huawei.com>

github-actions · 2026-03-10T15:40:25Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>

…to qwen3next_graph * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [main][bugfix] Fixed the problem of speculative decoding in FULL mode (vllm-project#7148) fixed fia pad logic in graph mode. (vllm-project#7144) [Doc] fix DSV3.1 PD configs (vllm-project#7187) refactor: add a check before layer_sharding logging (vllm-project#7186) [Build] Add support for Ascend950 chip (vllm-project#7151) Revert "[CI] fix skiped e2e test when upgrade vllm version (vllm-project#6654)" (vllm-project#7166) [MODELRUNNERV2]fix penality ops (vllm-project#7013) [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (vllm-project#6650) [KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (vllm-project#7146) [CI] Build Image for v0.16.0rc1 (vllm-project#7155) [CI] Skip `test_mooncake_layerwise_connector.py` in `ut` (vllm-project#7147) [BugFix]Fix recomputed scheduler bug (vllm-project#7137) [Model] Support Minimax-m2.5 on NPU (vllm-project#7105) [P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (vllm-project#7022) Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (vllm-project#7109) [Doc][ReleaseNote] Add release notes for v0.16.0rc1 (vllm-project#7067) [Misc] Download on both hk and guiyang region (vllm-project#7129) [bugdix] The problem that the w4a8 weight fails to be loaded when the EP is not enabled is resolved. (vllm-project#7090) [eagle][cp] fix eagle_cp enable bug2 (vllm-project#7079) [CI]Upgrade niglty multi-node-tests max-parallel to 2 (vllm-project#7035) ...

Cherry-pick from upstream PR vllm-project#7105: MoE all_reduce, k_norm weight sharding, fp8 load dequant, linear attn NPU RMSNorm, config patches.

### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>

Cherry-pick from upstream PR vllm-project#7105: MoE all_reduce, k_norm weight sharding, fp8 load dequant, linear attn NPU RMSNorm, config patches.

SparrowMu requested a review from wangxiyuan as a code owner March 10, 2026 04:16

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

Initial version to support minimax-m2.5 on vllm-ascend. This commit c…

2ad37bc

…overting original fp8 weight to a quantilized bf16. Signed-off-by: limuyuan <limuyuan3@huawei.com>

SparrowMu force-pushed the main branch from c1077d8 to 2ad37bc Compare March 10, 2026 13:51

Add introduction of the patch files to project.

cb3e8eb

Signed-off-by: limuyuan <limuyuan3@huawei.com>

zzzzwwjj approved these changes Mar 10, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Mar 10, 2026

Merge branch 'main' into main

2e56133

Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>

github-actions bot removed the merge-conflicts label Mar 10, 2026

wangxiyuan merged commit 54668e7 into vllm-project:main Mar 10, 2026
36 checks passed

realliujiaxu mentioned this pull request Mar 11, 2026

[Bug]: Multi node MiniMax-M2.5 #7159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Minimax-m2.5 on NPU#7105

[Model] Support Minimax-m2.5 on NPU#7105
wangxiyuan merged 3 commits intovllm-project:mainfrom
SparrowMu:main

SparrowMu commented Mar 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Some versions may read self.quantization before calling platform verifier.
		_disable_fp8(self, log=True)

Conversation

SparrowMu commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Test Report

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SparrowMu commented Mar 10, 2026 •

edited

Loading