Optimize minimax m2 modelling forward pass#2176
Conversation
|
@avtc I will take a look. It does do a lot of inplace tensor mutations which reduces the allocation of temp tensors which is great! Make sure that with or without the patch, the "error_loss" from gptq or awq quantization is the exactly the same for the first 2 layers. This is just a quick way to verify the |
|
@avtc huggingface/transformers#42028 Official Minimax M2 Pr created. So we should use the official code instead. |
|
@avtc Have you tried with the offical code? Does it resolve your mem usage issue? |
@Qubitium no, I thought the PR [https://github.com/huggingface/transformers/pull/42028] needs to be merged first (still open) and transformer needs to be released, so waiting for it to happen. And even after that I am not sure it will handle your BF16 checkpoint properly. |
Wth, its been 2 weeks and they still have not merged this Pr? Transformers is becoming a running meme now when it comes to Chinese model support. |
@Qubitium
With these optimizations made by gemini-pro/glm-4.6-q3/gpt I was able to proceed with quantization (at least 5 layers and proceeding) Minimax-m2 to int4g32 using
1024c4/en +512gsm/arc/humaneval/alpaca samples, total ~496K tokens, on 8x3090 GPUs, not on latest GPTQModel main, but on a fork prior to data-parallel + few chery picks - ( branch: https://github.com/avtc/GPTQModel/tree/feature/v4-minimax-m2-chery )Without these optimizations max number of samples able to forward pass attention module was 32 on the same branch.
(The latest main branch consumes more VRAM, so after forward pass there is no room for quantization - there are errors about hessian inverse will run on CPU and then CUDA OOM, will check if excluding device cuda:0 from forward/quantization will help little bit later).
The modelling py file should be placed into ModelCloud/MiniMax-M2-BF16 model folder prior quantization.
I have compared weights loss for quantized experts, and it is identical or very close to original modelling py when number of tokens for expert is the same. There are small deviation of number of tokens between experts 1-2 happen, idk if it caused by optimizations or expected. To compare I have used 32 samples for original and optimized version and checked auto-generated logs.
I.e. original:
{ "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.4.w1", "loss": "0.0008356524", "samples": "1053", "damp": "0.10000", "time": "2.121", "fwd_time": "6.724", "(v)ram": "8400.74MB, 2784.61MB", "dynamic": null } { "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.2.w1", "loss": "0.3386714458", "samples": "3", "damp": "0.10000", "time": "2.192", "fwd_time": "6.724", "(v)ram": "8400.74MB, 2793.61MB", "dynamic": null }optimized:
{ "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.4.w1", "loss": "0.0008337825", "samples": "1055", "damp": "0.10000", "time": "2.048", "fwd_time": "6.663", "(v)ram": "8389.34MB, 2810.38MB", "dynamic": null } { "process": "gptq", "layer": 0, "module": "block_sparse_moe.experts.2.w1", "loss": "0.3386579355", "samples": "3", "damp": "0.10000", "time": "2.140", "fwd_time": "6.663", "(v)ram": "8398.34MB, 2947.71MB", "dynamic": null }As I am not a python LLM/torch/CUDA dev, I cannot validate all these changes to be correct, but one of optimizations lead to 10 times more losses (reverted it), so I think comparing weight losses is a valid point of view.
Please review and fill free to use/adjust/complete.