Skip to content

Add support for MiniMax-M2#42028

Merged
vasqu merged 46 commits intohuggingface:mainfrom
rogeryoungh:minimax-m2
Jan 9, 2026
Merged

Add support for MiniMax-M2#42028
vasqu merged 46 commits intohuggingface:mainfrom
rogeryoungh:minimax-m2

Conversation

@rogeryoungh
Copy link
Contributor

What does this PR do?

This PR adds MiniMax-M2 model to Hugging Face Transformers from MiniMaxAI.

Relevant Links:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @Cyrilvallez

xuebi added 7 commits October 31, 2025 14:17
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean integration, no particular comments. Thank you! cc @Cyrilvallez for core review

rogeryoungh and others added 2 commits November 6, 2025 12:42
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
@molbap
Copy link
Contributor

molbap commented Nov 13, 2025

Failing tests on the hub are unrelated timeouts. Thanks for the update!

@molbap molbap requested a review from ArthurZucker November 13, 2025 14:37
@Qubitium
Copy link
Contributor

Qubitium commented Nov 19, 2025

@ArthurZucker lets gets this quickly merged. 2 weeks is infinity in the current model wars cycle.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments but looks already very nice, good job! Some things have changed on main so you will likely need to add our new weights converter to this model as well.

Imo, the integration tests seem suspicious to me and might not have been checked?

Comment on lines +232 to +251
class MiniMaxM2Experts(MixtralExperts):
pass


class MiniMaxM2SparseMoeBlock(MixtralSparseMoeBlock):
def __init__(self, config):
nn.Module.__init__(self)
self.top_k = config.num_experts_per_tok
self.jitter_noise = config.router_jitter_noise
self.gate = nn.Linear(config.hidden_size, config.num_local_experts, bias=False)
self.experts = MiniMaxM2Experts(config)
self.register_buffer("e_score_correction_bias", torch.zeros(config.num_local_experts))

def route_tokens_to_experts(self, router_logits):
routing_weights = torch.nn.functional.sigmoid(router_logits.float())
scores_for_choice = routing_weights + self.e_score_correction_bias
_, top_k_index = torch.topk(scores_for_choice, self.top_k, dim=-1, sorted=False)
top_k_weights = routing_weights.gather(1, top_k_index)
top_k_weights /= top_k_weights.sum(dim=-1, keepdim=True)
return top_k_index, top_k_weights.to(router_logits.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently changed how we build MoE, can you add this to our weight converter and adjust respectively; see

mapping["phimoe"] = mapping["mixtral"].copy()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibly also affects the tp plan

Copy link
Contributor

@Qubitium Qubitium Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vasqu @rogeryoungh I am not minimax staff but still have concerns myself.

Since HF Transformer changes broke this PR in terms of compat, what is exaclty is

def _build_checkpoint_conversion_mapping(): doing? What is WeightConverter?

I see no method level documentation and only internal code comments which looks like list generation to assist in concat/fusing of MoE modules? But how everything is glued together is still unclear.

Minimax internally does not use HF transformers stack and I think they would have the same question as I do. What exactly is the latest v5 changes regarding MoE? Is there an overview doc to make this clear?

@ArthurZucker Can we get a global (even in alpha form) doc/readme on the MoE arch changes in v5 so model makers has a clear eagle eye view of the v5 target when it comes to MoE?

Copy link
Contributor

@vasqu vasqu Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to answer here a bit #42028 (review)

  • _build_checkpoint_conversion_mapping is providing a mapping for model_type -> weight renaming+conversion
  • These operations can be split into 2 types
    • Renaming, should be self-explanatory --> rename weights
    • WeightConverter --> change the original weights in some way (spread out in core_model_loading
      • Chunk --> split the weights along a dimension into x chunks, e.g. useful for qkv splitting into q, k, and v
      • Concatenate --> essentially the reverse of chunk, e.g. fusing q, k, v into qkv
      • MergeModulelist --> merge a module list of tensors into one big tensor along a new dimension, e.g. when a module list should be one big parameter instead of a list (useful for us to merge MoE parameters and add hf kernels on top for more efficient inference etc)
      • PermuteForRope --> more complicated but essentially related to how weights are rotated, e.g. original llama needs this for hf RoPE conversion

You can stack these operations as well but we continue to change/add these.

The standard for MoE currently is to have fused experts into one big parameter instead of a modulelist. We use these to enable kernels, e.g.

@use_kernel_forward_from_hub("Llama4TextMoe")

This is for efficiency reasons in essence.

@vasqu
Copy link
Contributor

vasqu commented Nov 19, 2025

@Qubitium sorry for the delays, will be taking over for the most part. Crunching some important features for v5 so some PRs got lost under it, apologies

@rogeryoungh
Copy link
Contributor Author

I've encountered a strange problem:

>>> model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../transformers/src/transformers/models/auto/auto_factory.py", line 373, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/modeling_utils.py", line 278, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/modeling_utils.py", line 3967, in from_pretrained
    device_map = _get_device_map(model, device_map, max_memory, hf_quantizer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/integrations/accelerate.py", line 407, in _get_device_map
    device_map = infer_auto_device_map(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/integrations/accelerate.py", line 736, in infer_auto_device_map
    current_buffer_size = compute_module_total_buffer_size(module, hf_quantizer)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/integrations/accelerate.py", line 260, in compute_module_total_buffer_size
    module_sizes, _ = compute_module_sizes(model, hf_quantizer, buffers_only=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/integrations/accelerate.py", line 240, in compute_module_sizes
    dtype_size = hf_quantizer.param_element_size(model, name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/src/transformers/quantizers/base.py", line 182, in param_element_size
    return model.get_parameter_or_buffer(param_name).element_size()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "../transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1964, in __getattr__
    raise AttributeError(
AttributeError: 'MiniMaxM2DecoderLayer' object has no attribute 'get_parameter_or_buffer'

I found a solution #42342, which involves adding a get_parameter_or_buffer function to the MiniMaxM2RotaryEmbedding. However, for MiniMax-M2, e_score_correction_bias also needs to be changed to nn.Parameter.

@@ -106,7 +106,10 @@ class MiniMaxM2SparseMoeBlock(nn.Module):
         self.jitter_noise = config.router_jitter_noise
         self.gate = nn.Linear(config.hidden_size, config.num_local_experts, bias=False)
         self.experts = MiniMaxM2Experts(config)
-        self.register_buffer("e_score_correction_bias", torch.zeros(config.num_local_experts))
+        self.e_score_correction_bias = nn.Parameter(
+            torch.zeros(config.num_local_experts),
+            requires_grad=False,
+        )
 
     def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
         batch_size, sequence_length, hidden_dim = hidden_states.shape
@@ -168,6 +171,20 @@ class MiniMaxM2RotaryEmbedding(nn.Module):
         self.register_buffer("inv_freq", inv_freq, persistent=False)
         self.original_inv_freq = inv_freq
 
+    # Add a compatibility method so callers expecting PreTrainedModel-like API don't crash.
+    def get_parameter_or_buffer(self, name: str):
+        # Prefer direct attribute access (parameters and buffers are attributes)
+        if hasattr(self, name):
+            return getattr(self, name)
+        # Fallback: search named parameters and buffers (non-recursive to keep semantics)
+        for n, p in self.named_parameters(recurse=False):
+            if n == name:
+                return p
+        for n, b in self.named_buffers(recurse=False):
+            if n == name:
+                return b
+        raise AttributeError(f"{self.__class__.__name__} has no parameter or buffer named '{name}'")
+
     @staticmethod
     def compute_default_rope_parameters(
         config: Optional[MiniMaxM2Config] = None,

@MekkCyber
Copy link
Contributor

MekkCyber commented Nov 24, 2025

Hi @rogeryoungh, we are taking care of this in : https://github.com/huggingface/transformers/pull/42289/files#diff-7f40070336f6d7b1ffe08e654cdf930080e3cbd4dbbcbee2996fabe4ffc1c2b3, this should fix the issue without adding a get_parameter_or_buffer

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments which are mostly revolving around the weights name/convention. See #41580 for the respective PR that introduced this

The essence is that we now support on the fly renaming and weight conversion (similar to vllm). You only have to specify your operations in

def _build_checkpoint_conversion_mapping():

In your case, this will look very similar to Mixtral I assume, i.e. block_sparse->mlp + fusing the MoE modulelists to parameters. Maybe additional mlp renaming for the base MLP to inherit from LlamaMLP. Please let us know if it doesn't work as expected, it's still WIP in a way and we work on providing better docs for this (cc @ArthurZucker).

I know this is a lot at once so apologies.



@require_torch
class MiniMaxM2IntegrationTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide us a slimmed down model version, i.e. ideally running on A10 GPUs (24GB)? So that we can have some integration tests? I can move this model/copy it to our internal repos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a script to generate small models, but I encountered a problem: AutoModelForCausalLM.from_config does not follow the quantization_config settings, so the generated models do not have FP8Linear layers.
https://github.com/rogeryoungh/MiniMaxM2TinyModelGenerator/blob/main/generator.py

Copy link
Contributor

@vasqu vasqu Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would initialize random weights, no? I think we can just cut the original safetensors and load only a partial subset of the original instead (e.g. 8 layers), wdyt? This should keep the fp8 weights

Otherwise, even if it is saved in bf16 for example, we should be able to quantize to fp8 on the fly. Not ideal but the goal here is to have a working version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native fp8 weights for models initialized from config cc @MekkCyber @SunMarc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something we should add support for IMO @vasqu can you open an issue with this please!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #42804 to keep track

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

This comment contains run-slow, running the specified jobs:

models: ["models/minimax_m2"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@vasqu vasqu mentioned this pull request Jan 7, 2026
@vasqu
Copy link
Contributor

vasqu commented Jan 7, 2026

Sorry for the delays, just coming back from the holidays @rogeryoungh. I've synced with the latest changes

Will come back to this tomorrow but wanted to give a first insight, should be close to getting merged 🤗

@rogeryoungh
Copy link
Contributor Author

I found that the current code is not running correctly and throws an error at self.experts(hidden_states, top_k_index, top_k_weights). But I'm not sure if it's a problem with MiniMaxM2TopKRouter or FP8Expert.

I found a way to fix it:

@@ -59,7 +59,7 @@ class MiniMaxM2TopKRouter(nn.Module):
         _, top_k_index = torch.topk(scores_for_choice, self.top_k, dim=-1, sorted=False)
         top_k_weights = routing_weights.gather(1, top_k_index)
         top_k_weights /= top_k_weights.sum(dim=-1, keepdim=True)
-        router_scores = top_k_weights
+        router_scores = torch.zeros_like(routing_weights).scatter_(1, top_k_index, top_k_weights)
         return router_logits, router_scores, top_k_index

@vasqu
Copy link
Contributor

vasqu commented Jan 8, 2026

@rogeryoungh it will be fixed by #43154, but yes your solution would work as well but we removed that way of routing because scatter caused issues iirc (see #42456)

Will finalize that PR and update this branch then

@vasqu vasqu closed this in #43154 Jan 8, 2026
@vasqu vasqu reopened this Jan 8, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42028&sha=cca9ee

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, minimax_m2

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, minimax_m2

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, minimax_m2, finegrained_fp8

@vasqu
Copy link
Contributor

vasqu commented Jan 8, 2026

run-slow: minimax_m2

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

This comment contains run-slow, running the specified jobs:

models: ["models/minimax_m2"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42028&sha=b86771

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, minimax_m2, finegrained_fp8

@vasqu
Copy link
Contributor

vasqu commented Jan 8, 2026

Ok, now everything works and is synced with the latest changes on main. Hub PRs are open at

Lmk if you have any last changes @rogeryoungh, otherwise I'll merge tomorrow as it works from our side then.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean thanks!

@vasqu vasqu enabled auto-merge (squash) January 9, 2026 15:16
@vasqu vasqu merged commit 7a2bf25 into huggingface:main Jan 9, 2026
25 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

mratsim added a commit to mratsim/llm-compressor that referenced this pull request Jan 19, 2026
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* update: init m2

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: docs and config

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: init minimax-m2 test

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: fix tests

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: use partial_rotary_factor

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: some fix

Signed-off-by: xuebi <xuebi@minimaxi.com>

* fix: import Unpack from processing_utils

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: apply suggestions from code review

Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

* update: remove MiniMaxM2DecoderLayer and MiniMaxM2MLP

Signed-off-by: xuebi <xuebi@minimaxi.com>

* update: remove use_qk_norm

* update: remove unused use_qk_norm

* update: update config and attention

* update: add to tokenization_auto and remove unused test

* update: fix decoder layer and experts

* update: fix docs

* update: make ci happy

* refactor: use mapping

* update: remove unused comments

* update: fix rope_params and router

* update: remove rope_theta

* update: test_load_balancing_loss

* update: docs

* update: fix default theta

* update to proper default values, proper config rope, simplified modular

* fix docs

* modular fixup

* review comments

* update slow tests

* style

* fp32 strict

* revert the flag

* sync with latest changes

* fixup buffer init

* add cache exception to minimax m2 as we have a naming clash

* fix dtype issue in gate

* lift fp8 test restriction and apply new linter rules

* update docs

---------

Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
@vasqu vasqu added the New model label Feb 3, 2026
mratsim added a commit to mratsim/llm-compressor that referenced this pull request Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants