feat: add Mistral Small 4 by NanoCode012 · Pull Request #3502 · axolotl-ai-cloud/axolotl

NanoCode012 · 2026-03-16T19:26:14Z

Description

https://mistral.ai/news/mistral-small-4

FFT config is untested given lack of time. QLoRA is tested.

Requires their transformer fork huggingface/transformers#44760 (or main now)

Blocker:

We converted the mistral weights to BF16 to train on. The HF is only FP8 atm

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude helped run exp.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

coderabbitai · 2026-03-16T19:26:37Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c63ee948-14fe-4950-bbc9-5809ac359a33

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds comprehensive support for Mistral4 model training, including new configuration examples for FFT and QLoRA approaches with text and vision datasets. It updates model architecture registries, implements MoE routing logic, updates dependencies, and enhances configuration normalization and processor initialization.

Changes

Cohort / File(s)	Summary
Mistral4 Documentation & Configuration Examples `examples/mistral4/README.md`, `examples/mistral4/fft-text.yml`, `examples/mistral4/fft-vision.yml`, `examples/mistral4/qlora-text.yml`, `examples/mistral4/qlora-vision.yml`	Adds comprehensive documentation and four training configuration files for Mistral4 model fine-tuning using FFT (full fine-tuning) and QLoRA (quantized LoRA) approaches with both text and vision datasets, including hyperparameters, LoRA targets, and FSDP configuration.
Dependency & Installation Updates `requirements.txt`, `scripts/cutcrossentropy_install.py`, `examples/colab-notebooks/colab-axolotl-example.ipynb`, `src/axolotl/integrations/cut_cross_entropy/...`	Updates mistral-common from 1.8.8 to 1.10.0 and changes ml-cross-entropy commit hash from e8ad129 to fa9a7fe across Colab notebook, installation script, README documentation, and integration initialization files. Extends supported models list to include mistral4 and nemotron_h.
Model Architecture Registration `src/axolotl/common/architectures.py`, `src/axolotl/integrations/kernels/constants.py`, `src/axolotl/monkeypatch/multipack.py`	Registers "mistral4" model type in MOE_ARCH_BLOCK, SPARSE_MOE_BLOCK mappings, and SUPPORTED_MULTIPACK_MODEL_TYPES list to enable MoE architecture support and multipack optimization.
MoE Routing Implementation `src/axolotl/integrations/kernels/sonicmoe/routing.py`	Introduces softmax_group_topk_routing function implementing group-based top-k expert selection with renormalization and scaling for Mistral4. Extends model configuration mapping to wire mistral4 to new routing strategy with SwiGLU activation.
Configuration & Model Loading `src/axolotl/utils/config/__init__.py`, `src/axolotl/loaders/model.py`, `src/axolotl/loaders/processor.py`, `src/axolotl/integrations/kernels/plugin.py`	Adds VLM text backbone type resolution for model_config_type_text, refactors MOE block lookup to prefer resolved text type, changes processor tokenizer assignment from parameter injection to post-instantiation, and updates MoE kernelization logging to use resolved model type.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

chore: bump cut-cross-entropy to 58d6572 #3424: Updates cut-cross-entropy git commit hash and supported model list across multiple integration files.
Add support for batched_mm, grouped_mm and scattermoe for MoE models #3377: Modifies kernels integration and MoE model handling including plugin/kernelization logic for MoE model variants.
Add support for Accelerate CP, ND examples, and fix for parallel config w fsdp #3019: Updates MOE_ARCH_BLOCK and SUPPORTED_MULTIPACK_MODEL_TYPES with new model type registrations.

Suggested labels

under review

Suggested reviewers

winglian
djsaunde
SalmanMohammadi

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add Mistral Small 4' directly corresponds to the main objective of adding Mistral 4 (Leanstral) support throughout the codebase, as evidenced by new configurations, routing logic, and architecture mappings for mistral4.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/mistral4

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-16T19:33:07Z

📖 Documentation Preview: https://69b8c047bc53a126672293e2--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 07c3c5b

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

examples/mistral4/README.md (1)

69-69: Consider using more descriptive link text.

The link text "here" is flagged by static analysis as non-descriptive. More descriptive link text improves accessibility and helps users understand what they'll find before clicking.

📝 Suggested improvement

-- The text dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
+- The text dataset format follows the OpenAI Messages format as documented in the [chat_template dataset format guide](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/mistral4/README.md` at line 69, Replace the non-descriptive link
text "here" in the README sentence about the text dataset format with a clear,
descriptive label such as "Axolotl OpenAI Messages format documentation" or
"OpenAI Messages format (Axolotl)" so the link reads like: "The text dataset
format follows the OpenAI Messages format as seen in the Axolotl documentation."
Update the anchor text accordingly in examples/mistral4/README.md to improve
accessibility and clarity.

src/axolotl/integrations/kernels/sonicmoe/routing.py (1)

137-160: Add explicit group-routing invariants before view/topk (and clean up unused unpack).

This path currently assumes valid MoE metadata; invalid values will fail deep in tensor ops with hard-to-debug errors. A fast-fail check makes failures deterministic and clearer.

Proposed hardening patch

 def softmax_group_topk_routing(
     hidden_states: torch.Tensor, moe_block
 ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
     """Mistral4-style routing: softmax -> group selection -> topk -> renorm -> scale."""
     gate = moe_block.gate
-    T, H = hidden_states.shape
+    T, _ = hidden_states.shape
     K = moe_block.top_k
     E = getattr(moe_block, "n_routed_experts", gate.weight.shape[0])
     n_group = getattr(moe_block, "n_group", 1)
+    if E % n_group != 0:
+        raise ValueError(f"Invalid routing layout: n_routed_experts={E} not divisible by n_group={n_group}")
+    group_size = E // n_group
+    topk_group = getattr(moe_block, "topk_group", n_group)
+    if topk_group > n_group:
+        raise ValueError(f"Invalid topk_group={topk_group}; must be <= n_group={n_group}")
+    if n_group > 1 and group_size < 2:
+        raise ValueError(f"Invalid group_size={group_size}; group routing requires at least 2 experts per group")

     router_logits = F.linear(hidden_states, gate.weight)  # [T, E]
     router_probs = F.softmax(router_logits, dim=-1, dtype=torch.float32)  # [T, E]

     scores_for_choice = router_probs

     # Group selection: pick top groups, mask the rest
     if n_group > 1:
         group_scores = (
-            scores_for_choice.view(-1, n_group, E // n_group)
+            scores_for_choice.view(-1, n_group, group_size)
             .topk(2, dim=-1)[0]
             .sum(dim=-1)
         )
         group_idx = torch.topk(
-            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
+            group_scores, k=topk_group, dim=-1, sorted=False
         )[1]
         group_mask = torch.zeros_like(group_scores)
         group_mask.scatter_(1, group_idx, 1)
         score_mask = (
-            group_mask.unsqueeze(-1).expand(-1, n_group, E // n_group).reshape(-1, E)
+            group_mask.unsqueeze(-1).expand(-1, n_group, group_size).reshape(-1, E)
         )
         scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 137 - 160,
Add explicit fast-fail checks before the group-routing reshape and topk to
validate MoE metadata and avoid cryptic tensor errors: assert or raise a clear
ValueError if n_group <= 0, if E % n_group != 0, or if getattr(moe_block,
"topk_group", 1) > n_group; also verify hidden_states.dim() == 2 and that T
matches hidden_states.shape[0] if you keep T. Replace the unused unpack "T, H =
hidden_states.shape" with "T, _ = hidden_states.shape" (or use
hidden_states.size(0) for T), and then perform the checks prior to
scores_for_choice.view(...) and torch.topk so failures show meaningful messages
referencing n_group, E, and moe_block.topk_group.

examples/mistral4/fft-vision.yml (1)

20-20: Use a model/example-specific output_dir to avoid accidental run collisions.

Using the same ./outputs/out across the new mistral4 example files can overwrite checkpoints/logs when users try multiple configs back-to-back.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/mistral4/fft-vision.yml` at line 20, The config uses a generic
output_dir value ("./outputs/out") which can cause run collisions; update the
output_dir key in fft-vision.yml to a model/example-specific path (e.g. include
the model and example name or an experiment variable such as
"./outputs/mistral4_fft-vision" or "./outputs/${EXPERIMENT_NAME}") so each
example run writes to a unique directory.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/mistral4/fft-text.yml`:
- Around line 1-59: The FSDP setting references transformer_layer_cls_to_wrap:
Mistral4DecoderLayer which does not exist; update this to the actual transformer
layer class used by the Leanstral model (verify in the model implementation for
the correct class name, e.g., MistralDecoderLayer or the Leanstral-specific
layer) or remove the transformer_layer_cls_to_wrap entry so FSDP auto-detection
is used; ensure transformer_layer_cls_to_wrap is set to the exact class symbol
exported by the model code that matches base_model: mistralai/Leanstral-2603-HF.

In `@src/axolotl/common/architectures.py`:
- Line 19: The mapping entry with key "mistral4" and value "Mistral4MoE" is
invalid because transformers has no Mistral4MoE class; update the mapping in the
architectures dictionary by either renaming the key to the existing model type
("mistral") or replacing the value with an actual transformers class name (e.g.,
"MistralForCausalLM"), so get_module_class_from_name can resolve the class;
locate the entry referencing "mistral4" and "Mistral4MoE" and change it to a
supported pair (for example "mistral": "MistralForCausalLM") or remove the
unsupported mapping.

In `@src/axolotl/integrations/kernels/constants.py`:
- Around line 28-29: The mapping entry "mistral4": "Mistral4MoE" in the
model->class map is unresolved and will cause resolve_moe_block_classes to
attempt importing transformers.models.mistral4.modeling_mistral4 and raise
ModuleNotFoundError; either remove the "mistral4" entry from the mapping or
replace it with a supported model/class already present in transformers, or if
this is a custom implementation ensure the corresponding module and class are
added to the codebase and importable as modeling_mistral4 with class Mistral4MoE
so resolve_moe_block_classes can import it successfully.

---

Nitpick comments:
In `@examples/mistral4/fft-vision.yml`:
- Line 20: The config uses a generic output_dir value ("./outputs/out") which
can cause run collisions; update the output_dir key in fft-vision.yml to a
model/example-specific path (e.g. include the model and example name or an
experiment variable such as "./outputs/mistral4_fft-vision" or
"./outputs/${EXPERIMENT_NAME}") so each example run writes to a unique
directory.

In `@examples/mistral4/README.md`:
- Line 69: Replace the non-descriptive link text "here" in the README sentence
about the text dataset format with a clear, descriptive label such as "Axolotl
OpenAI Messages format documentation" or "OpenAI Messages format (Axolotl)" so
the link reads like: "The text dataset format follows the OpenAI Messages format
as seen in the Axolotl documentation." Update the anchor text accordingly in
examples/mistral4/README.md to improve accessibility and clarity.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 137-160: Add explicit fast-fail checks before the group-routing
reshape and topk to validate MoE metadata and avoid cryptic tensor errors:
assert or raise a clear ValueError if n_group <= 0, if E % n_group != 0, or if
getattr(moe_block, "topk_group", 1) > n_group; also verify hidden_states.dim()
== 2 and that T matches hidden_states.shape[0] if you keep T. Replace the unused
unpack "T, H = hidden_states.shape" with "T, _ = hidden_states.shape" (or use
hidden_states.size(0) for T), and then perform the checks prior to
scores_for_choice.view(...) and torch.topk so failures show meaningful messages
referencing n_group, E, and moe_block.topk_group.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c7fddb8c-105c-4304-9c7b-844ad0a73364

📥 Commits

Reviewing files that changed from the base of the PR and between 7da5f94 and ca89962.

📒 Files selected for processing (18)

examples/colab-notebooks/colab-axolotl-example.ipynb
examples/mistral4/README.md
examples/mistral4/fft-text.yml
examples/mistral4/fft-vision.yml
examples/mistral4/qlora-text.yml
examples/mistral4/qlora-vision.yml
requirements.txt
scripts/cutcrossentropy_install.py
src/axolotl/common/architectures.py
src/axolotl/integrations/cut_cross_entropy/README.md
src/axolotl/integrations/cut_cross_entropy/__init__.py
src/axolotl/integrations/kernels/constants.py
src/axolotl/integrations/kernels/plugin.py
src/axolotl/integrations/kernels/sonicmoe/routing.py
src/axolotl/loaders/model.py
src/axolotl/loaders/processor.py
src/axolotl/monkeypatch/multipack.py
src/axolotl/utils/config/__init__.py

coderabbitai · 2026-03-16T19:41:43Z

+base_model: mistralai/Leanstral-2603-HF
+
+plugins:
+  - axolotl.integrations.kernels.KernelsPlugin
+use_kernels: true
+use_sonicmoe: true
+
+# only train language model layers, freeze vision tower
+unfrozen_parameters:
+  - model.language_model.*
+  - lm_head
+  - embed_tokens
+
+datasets:
+  - path: fozziethebeat/alpaca_messages_2k_test
+    type: chat_template
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./outputs/out
+
+sequence_len: 2048
+sample_packing: true
+eval_sample_packing: true
+pad_to_sequence_len: true
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 1
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 2e-5
+
+bf16: true
+tf32: true
+
+gradient_checkpointing: true
+logging_steps: 1
+flash_attention: true
+
+warmup_ratio: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+weight_decay: 0.0
+
+fsdp_version: 2
+fsdp_config:
+  offload_params: false
+  cpu_ram_efficient_loading: false
+  state_dict_type: FULL_STATE_DICT
+  auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  transformer_layer_cls_to_wrap: Mistral4DecoderLayer
+  reshard_after_forward: true


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify Mistral4DecoderLayer exists for FSDP wrapping python -c " try: from transformers.models.mistral4.modeling_mistral4 import Mistral4DecoderLayer print('SUCCESS: Mistral4DecoderLayer found') except ImportError as e: print(f'ERROR: {e}') "

Repository: axolotl-ai-cloud/axolotl

Length of output: 105

🏁 Script executed:

find . -type f -name "*.yml" | head -20 | xargs ls -la

Repository: axolotl-ai-cloud/axolotl

Length of output: 1696

🏁 Script executed:

fd "mistral" --type f | head -30

Repository: axolotl-ai-cloud/axolotl

Length of output: 886

🏁 Script executed:

rg "Mistral4" --type yml --type py --type txt -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 95

🏁 Script executed:

rg "mistralai/Leanstral" --type yml --type py --type txt -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 95

🏁 Script executed:

fd "\.yml$" examples/ | grep -i mistral

Repository: axolotl-ai-cloud/axolotl

Length of output: 675

🏁 Script executed:

rg "Mistral4" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 353

🏁 Script executed:

cat examples/mistral4/qlora-text.yml

Repository: axolotl-ai-cloud/axolotl

Length of output: 1037

🏁 Script executed:

rg "KernelsPlugin" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 288

🏁 Script executed:

rg "Leanstral" -r .

Repository: axolotl-ai-cloud/axolotl

Length of output: 716

🏁 Script executed:

rg "Mistral4DecoderLayer" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 50

🏁 Script executed:

rg "DecoderLayer" -t py | grep -i mistral4

Repository: axolotl-ai-cloud/axolotl

Length of output: 50

🏁 Script executed:

rg "transformer_layer_cls_to_wrap" examples/

Repository: axolotl-ai-cloud/axolotl

Length of output: 4672

🏁 Script executed:

cat examples/mistral4/README.md

Repository: axolotl-ai-cloud/axolotl

Length of output: 2706

Mistral4DecoderLayer class does not exist in the codebase and should be verified before using this configuration.

The FSDP configuration references transformer_layer_cls_to_wrap: Mistral4DecoderLayer, but this class is not found in the codebase. While the pattern matches other supported models (e.g., MistralDecoderLayer, Qwen2DecoderLayer), the corresponding class for Mistral4 is missing. The configuration will fail at runtime when FSDP attempts to wrap layers.

Note: The README explicitly states these configs are experimental and require hyperparameter adjustments. Verify the correct transformer layer class name from the Leanstral model implementation before running this configuration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/mistral4/fft-text.yml` around lines 1 - 59, The FSDP setting references transformer_layer_cls_to_wrap: Mistral4DecoderLayer which does not exist; update this to the actual transformer layer class used by the Leanstral model (verify in the model implementation for the correct class name, e.g., MistralDecoderLayer or the Leanstral-specific layer) or remove the transformer_layer_cls_to_wrap entry so FSDP auto-detection is used; ensure transformer_layer_cls_to_wrap is set to the exact class symbol exported by the model code that matches base_model: mistralai/Leanstral-2603-HF.

coderabbitai · 2026-03-16T19:41:44Z

    "qwen3_vl_moe": "Qwen3VLMoeTextSparseMoeBlock",
    "deepseek_v2": "DeepseekV2MoE",
    "deepseek_v3": "DeepseekV3MoE",
+    "mistral4": "Mistral4MoE",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for Mistral4MoE class definition in transformers python -c " try: from transformers.models.mistral4.modeling_mistral4 import Mistral4MoE print('SUCCESS: Mistral4MoE class found') except ImportError as e: print(f'ERROR: {e}') except Exception as e: print(f'ERROR: {e}') "

Repository: axolotl-ai-cloud/axolotl

Length of output: 105

🏁 Script executed:

#!/bin/bash # First, let's look at the full context of the MOE_ARCH_BLOCK mapping cat -n src/axolotl/common/architectures.py | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 1004

🏁 Script executed:

#!/bin/bash # Search for references to MOE_ARCH_BLOCK usage rg "MOE_ARCH_BLOCK" --type py -B 2 -A 2

Repository: axolotl-ai-cloud/axolotl

Length of output: 1115

🏁 Script executed:

#!/bin/bash # Search for any references to Mistral4MoE in the codebase rg "Mistral4MoE" --type py

Repository: axolotl-ai-cloud/axolotl

Length of output: 211

🏁 Script executed:

#!/bin/bash # Search for mistral4 model references rg "mistral4" --type py -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 753

🏁 Script executed:

#!/bin/bash # Check what other models are in similar architecture mappings rg "MoE\|Moe" src/axolotl/common/architectures.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 50

🏁 Script executed:

#!/bin/bash # Look for how transformers models are imported/loaded rg "from transformers" src/axolotl/loaders/model.py | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 327

🏁 Script executed:

#!/bin/bash # Check if there's any documentation or requirements about mistral4 fd "requirements" --type f | xargs cat 2>/dev/null | grep -i mistral

Repository: axolotl-ai-cloud/axolotl

Length of output: 90

🏁 Script executed:

#!/bin/bash # Look for transformers version or mistral-related dependencies cat setup.py 2>/dev/null || cat pyproject.toml 2>/dev/null | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 8078

🌐 Web query:

mistral4 model transformers library

💡 Result:

There isn’t a dedicated “Mistral4” architecture/class in the Hugging Face transformers library; official support is for Mistral (and related families like Mistral3/Mixtral) and you typically load checkpoints via the Auto* APIs. [1]

If by “mistral4” you mean a community Hugging Face repo named .../mistral4 (e.g., kouki13/mistral4), it’s just a model name on the Hub and will generally still load as a Mistral-compatible causal LM via AutoModelForCausalLM. [4]

from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "kouki13/mistral4" # or your HF repo id tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

transformers added first-party Mistral support starting with v4.34.0 (and it has been maintained since). [2]

If you meant Mistral AI’s hosted “latest” models (API models like “Mistral Large/Medium/Small/…”, etc.), those are listed in Mistral’s model overview docs, but they’re not a single “mistral4” transformers model class. [3]

Sources: [1] [2] [3] [4]

Mistral4MoE class does not exist in the transformers library.

The transformers library does not include a mistral4 architecture or Mistral4MoE class (only the standard mistral architecture is supported). This mapping will fail at runtime when get_module_class_from_name attempts to resolve the class from the loaded model's hierarchy for DeepSpeed Zero3 leaf module configuration. Either the model type or class name needs to be corrected to match an existing transformers architecture.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/common/architectures.py` at line 19, The mapping entry with key "mistral4" and value "Mistral4MoE" is invalid because transformers has no Mistral4MoE class; update the mapping in the architectures dictionary by either renaming the key to the existing model type ("mistral") or replacing the value with an actual transformers class name (e.g., "MistralForCausalLM"), so get_module_class_from_name can resolve the class; locate the entry referencing "mistral4" and "Mistral4MoE" and change it to a supported pair (for example "mistral": "MistralForCausalLM") or remove the unsupported mapping.

coderabbitai · 2026-03-16T19:41:44Z

+    # softmax -> topk routing (with group-based expert selection)
+    "mistral4": "Mistral4MoE",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the module path and class exist python -c " import importlib try: module = importlib.import_module('transformers.models.mistral4.modeling_mistral4') cls = getattr(module, 'Mistral4MoE', None) if cls: print('SUCCESS: Mistral4MoE found in transformers.models.mistral4.modeling_mistral4') else: print('ERROR: Mistral4MoE class not found in module') except ModuleNotFoundError as e: print(f'ERROR: Module not found - {e}') "

Repository: axolotl-ai-cloud/axolotl

Length of output: 124

🏁 Script executed:

# First, let's read the constants.py file to understand the full context cat -n src/axolotl/integrations/kernels/constants.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3407

🏁 Script executed:

# Search for any existing mistral4 references in the codebase rg "mistral4" --no-heading

Repository: axolotl-ai-cloud/axolotl

Length of output: 1036

🏁 Script executed:

# Check what other models are in the SPARSE_MOE_BLOCK dictionary to understand the pattern rg "SPARSE_MOE_BLOCK" -A 20

Repository: axolotl-ai-cloud/axolotl

Length of output: 3467

🌐 Web query:

transformers library mistral4 modeling class Mistral4MoE

💡 Result:

transformers does not currently include any modeling class named Mistral4MoE (and there’s no official “Mistral4” modeling module in the library).

If you meant Mistral’s MoE model “Mixtral”, the corresponding Hugging Face Transformers modeling classes live under transformers.models.mixtral, notably:

MixtralModel

MixtralForCausalLM (plus seq/token classification + QA variants) [1]

Example import:

from transformers import MixtralForCausalLM # or: from transformers.models.mixtral.modeling_mixtral import MixtralForCausalLM ``` [1][2] **Sources:** [1][2] --- </details> **Remove mistral4 entry or verify transformers support exists.** The mapping `"mistral4": "Mistral4MoE"` cannot be resolved. The transformers library does not include a `mistral4` module or `Mistral4MoE` class. The `resolve_moe_block_classes` function will fail with `ModuleNotFoundError` when attempting to import `transformers.models.mistral4.modeling_mistral4`. Confirm whether this is a custom addition requiring external implementation or if a different model type should be used instead. <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @src/axolotl/integrations/kernels/constants.py around lines 28 - 29, The
mapping entry "mistral4": "Mistral4MoE" in the model->class map is unresolved
and will cause resolve_moe_block_classes to attempt importing
transformers.models.mistral4.modeling_mistral4 and raise ModuleNotFoundError;
either remove the "mistral4" entry from the mapping or replace it with a
supported model/class already present in transformers, or if this is a custom
implementation ensure the corresponding module and class are added to the
codebase and importable as modeling_mistral4 with class Mistral4MoE so
resolve_moe_block_classes can import it successfully.

</details>  

codecov · 2026-03-16T21:59:09Z

Codecov Report

❌ Patch coverage is 12.19512% with 36 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...c/axolotl/integrations/kernels/sonicmoe/routing.py	3.33%	29 Missing ⚠️
src/axolotl/integrations/kernels/plugin.py	0.00%	3 Missing ⚠️
src/axolotl/loaders/model.py	0.00%	3 Missing ⚠️
src/axolotl/utils/config/__init__.py	75.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012 added 8 commits March 16, 2026 14:13

feat: add mistral small 4

e5ee131

fix: update mistral common

f4ca7ae

fix: deepcopy when passing in tokenizer

83f7766

feat: add doc on reasoning and thinking section

2fd3134

fix: don't use custom tokenizer and quantize experts

eab7eb8

chore: update docs and configs

4390546

chore: update doc to follow official name

1b069a2

feat: update cce to include mistral4

ca89962

chore: move

8fa98fb

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

NanoCode012 added 2 commits March 17, 2026 08:22

fix: naming

f789c86

fix: test mock breaking get_text_config check

d008c95

NanoCode012 changed the title ~~feat: add Leanstral (mistral 4)~~ feat: add Mistral Small 4 Mar 17, 2026

NanoCode012 added 6 commits March 17, 2026 09:08

fix: enable CCE and add expert block targetting to configs

f4f1b6a

chore: docs

a2703a6

fix: use act checkpointing

5ec5aa4

chore: doc

b13c4fc

chore: docs

33972f1

chore: docs

07c3c5b

NanoCode012 merged commit a098df5 into main Mar 17, 2026
12 of 15 checks passed

NanoCode012 deleted the feat/mistral4 branch March 17, 2026 02:39

This was referenced Mar 18, 2026

super nemo support #3508

Merged

feat: add custom routing support for ernie4_5_moe, and hunyuan_v1_moe #3526

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Mistral Small 4#3502

feat: add Mistral Small 4#3502
NanoCode012 merged 17 commits into
mainfrom
feat/mistral4

NanoCode012 commented Mar 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Uh oh!

coderabbitai Bot Mar 16, 2026

Uh oh!

coderabbitai Bot Mar 16, 2026

Uh oh!

codecov Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# softmax -> topk routing (with group-based expert selection)
		"mistral4": "Mistral4MoE",

Uh oh!

Conversation

NanoCode012 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Uh oh!

coderabbitai Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NanoCode012 commented Mar 16, 2026 •

edited

Loading

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading

github-actions Bot commented Mar 16, 2026 •

edited

Loading

codecov Bot commented Mar 16, 2026 •

edited

Loading