Optimize minimax m2 modelling forward pass by avtc · Pull Request #2176 · ModelCloud/GPTQModel

avtc · 2025-11-05T16:00:42Z

@Qubitium
With these optimizations made by gemini-pro/glm-4.6-q3/gpt I was able to proceed with quantization (at least 5 layers and proceeding) Minimax-m2 to int4g32 using 1024 c4/en + 512 gsm/arc/humaneval/alpaca samples, total ~496K tokens, on 8x3090 GPUs, not on latest GPTQModel main, but on a fork prior to data-parallel + few chery picks - ( branch: https://github.com/avtc/GPTQModel/tree/feature/v4-minimax-m2-chery )

Without these optimizations max number of samples able to forward pass attention module was 32 on the same branch.
(The latest main branch consumes more VRAM, so after forward pass there is no room for quantization - there are errors about hessian inverse will run on CPU and then CUDA OOM, will check if excluding device cuda:0 from forward/quantization will help little bit later).

The modelling py file should be placed into ModelCloud/MiniMax-M2-BF16 model folder prior quantization.

I have compared weights loss for quantized experts, and it is identical or very close to original modelling py when number of tokens for expert is the same. There are small deviation of number of tokens between experts 1-2 happen, idk if it caused by optimizations or expected. To compare I have used 32 samples for original and optimized version and checked auto-generated logs.
I.e. original:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008356524",
    "samples": "1053",
    "damp": "0.10000",
    "time": "2.121",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2784.61MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386714458",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.192",
    "fwd_time": "6.724",
    "(v)ram": "8400.74MB, 2793.61MB",
    "dynamic": null
}

optimized:

{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.4.w1",
    "loss": "0.0008337825",
    "samples": "1055",
    "damp": "0.10000",
    "time": "2.048",
    "fwd_time": "6.663",
    "(v)ram": "8389.34MB, 2810.38MB",
    "dynamic": null
}
{
    "process": "gptq",
    "layer": 0,
    "module": "block_sparse_moe.experts.2.w1",
    "loss": "0.3386579355",
    "samples": "3",
    "damp": "0.10000",
    "time": "2.140",
    "fwd_time": "6.663",
    "(v)ram": "8398.34MB, 2947.71MB",
    "dynamic": null
}

As I am not a python LLM/torch/CUDA dev, I cannot validate all these changes to be correct, but one of optimizations lead to 10 times more losses (reverted it), so I think comparing weight losses is a valid point of view.

Please review and fill free to use/adjust/complete.

Qubitium · 2025-11-06T02:09:12Z

@avtc I will take a look. It does do a lot of inplace tensor mutations which reduces the allocation of temp tensors which is great!

Make sure that with or without the patch, the "error_loss" from gptq or awq quantization is the exactly the same for the first 2 layers. This is just a quick way to verify the output of the module layer forward is the same before/after PR. Please get the ai to generate unit test and put into tests folder. The unit test should just init a random Minimax layer with random values for all the modules (or you can get the unit test to load the first layer of the real BF16 model). And then give it random input and make sure the output tensors values are the same before/after PR. This ensure the changes for this PR is deteriministic.

Qubitium · 2025-11-06T17:17:16Z

@avtc huggingface/transformers#42028

Official Minimax M2 Pr created. So we should use the official code instead.

Qubitium · 2025-11-19T02:08:24Z

@avtc Have you tried with the offical code? Does it resolve your mem usage issue?

avtc · 2025-11-19T08:26:12Z

@avtc Have you tried with the offical code? Does it resolve your mem usage issue?

@Qubitium no, I thought the PR [https://github.com/huggingface/transformers/pull/42028] needs to be merged first (still open) and transformer needs to be released, so waiting for it to happen. And even after that I am not sure it will handle your BF16 checkpoint properly.

Qubitium · 2025-11-19T08:40:22Z

@avtc Have you tried with the offical code? Does it resolve your mem usage issue?

@Qubitium no, I thought the PR [https://github.com/huggingface/transformers/pull/42028] needs to be merged first (still open) and transformer needs to be released, so waiting for it to happen. And even after that I am not sure it will handle your BF16 checkpoint properly.

Wth, its been 2 weeks and they still have not merged this Pr? Transformers is becoming a running meme now when it comes to Chinese model support.

avtc added 5 commits November 4, 2025 20:28

first 5 iterations

b5ad1f3

opt.6

1fb8723

opt.7

eed748a

opt.8-14 glm-4.6 + gpt

339e38b

opt.15 chunks for attn

75b9d13

Qubitium merged commit a386df3 into ModelCloud:main Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize minimax m2 modelling forward pass#2176

Optimize minimax m2 modelling forward pass#2176
Qubitium merged 5 commits intoModelCloud:mainfrom
avtc:feature/optimize-minimax-m2-modelling

avtc commented Nov 5, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 19, 2025

Uh oh!

avtc commented Nov 19, 2025 •

edited

Loading

Uh oh!

Qubitium commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

avtc commented Nov 5, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 6, 2025

Uh oh!

Qubitium commented Nov 19, 2025

Uh oh!

avtc commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avtc commented Nov 19, 2025 •

edited

Loading