Skip to content

feat: add Mistral Small 4#3502

Merged
NanoCode012 merged 17 commits into
mainfrom
feat/mistral4
Mar 17, 2026
Merged

feat: add Mistral Small 4#3502
NanoCode012 merged 17 commits into
mainfrom
feat/mistral4

Conversation

@NanoCode012

@NanoCode012 NanoCode012 commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

Description

https://mistral.ai/news/mistral-small-4

FFT config is untested given lack of time. QLoRA is tested.

Requires their transformer fork huggingface/transformers#44760 (or main now)

Blocker:

  • We converted the mistral weights to BF16 to train on. The HF is only FP8 atm

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude helped run exp.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@coderabbitai

coderabbitai Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c63ee948-14fe-4950-bbc9-5809ac359a33

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds comprehensive support for Mistral4 model training, including new configuration examples for FFT and QLoRA approaches with text and vision datasets. It updates model architecture registries, implements MoE routing logic, updates dependencies, and enhances configuration normalization and processor initialization.

Changes

Cohort / File(s) Summary
Mistral4 Documentation & Configuration Examples
examples/mistral4/README.md, examples/mistral4/fft-text.yml, examples/mistral4/fft-vision.yml, examples/mistral4/qlora-text.yml, examples/mistral4/qlora-vision.yml
Adds comprehensive documentation and four training configuration files for Mistral4 model fine-tuning using FFT (full fine-tuning) and QLoRA (quantized LoRA) approaches with both text and vision datasets, including hyperparameters, LoRA targets, and FSDP configuration.
Dependency & Installation Updates
requirements.txt, scripts/cutcrossentropy_install.py, examples/colab-notebooks/colab-axolotl-example.ipynb, src/axolotl/integrations/cut_cross_entropy/...
Updates mistral-common from 1.8.8 to 1.10.0 and changes ml-cross-entropy commit hash from e8ad129 to fa9a7fe across Colab notebook, installation script, README documentation, and integration initialization files. Extends supported models list to include mistral4 and nemotron_h.
Model Architecture Registration
src/axolotl/common/architectures.py, src/axolotl/integrations/kernels/constants.py, src/axolotl/monkeypatch/multipack.py
Registers "mistral4" model type in MOE_ARCH_BLOCK, SPARSE_MOE_BLOCK mappings, and SUPPORTED_MULTIPACK_MODEL_TYPES list to enable MoE architecture support and multipack optimization.
MoE Routing Implementation
src/axolotl/integrations/kernels/sonicmoe/routing.py
Introduces softmax_group_topk_routing function implementing group-based top-k expert selection with renormalization and scaling for Mistral4. Extends model configuration mapping to wire mistral4 to new routing strategy with SwiGLU activation.
Configuration & Model Loading
src/axolotl/utils/config/__init__.py, src/axolotl/loaders/model.py, src/axolotl/loaders/processor.py, src/axolotl/integrations/kernels/plugin.py
Adds VLM text backbone type resolution for model_config_type_text, refactors MOE block lookup to prefer resolved text type, changes processor tokenizer assignment from parameter injection to post-instantiation, and updates MoE kernelization logging to use resolved model type.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested labels

under review

Suggested reviewers

  • winglian
  • djsaunde
  • SalmanMohammadi
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add Mistral Small 4' directly corresponds to the main objective of adding Mistral 4 (Leanstral) support throughout the codebase, as evidenced by new configurations, routing logic, and architecture mappings for mistral4.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/mistral4
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

📖 Documentation Preview: https://69b8c047bc53a126672293e2--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 07c3c5b

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
examples/mistral4/README.md (1)

69-69: Consider using more descriptive link text.

The link text "here" is flagged by static analysis as non-descriptive. More descriptive link text improves accessibility and helps users understand what they'll find before clicking.

📝 Suggested improvement
-- The text dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
+- The text dataset format follows the OpenAI Messages format as documented in the [chat_template dataset format guide](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/mistral4/README.md` at line 69, Replace the non-descriptive link
text "here" in the README sentence about the text dataset format with a clear,
descriptive label such as "Axolotl OpenAI Messages format documentation" or
"OpenAI Messages format (Axolotl)" so the link reads like: "The text dataset
format follows the OpenAI Messages format as seen in the Axolotl documentation."
Update the anchor text accordingly in examples/mistral4/README.md to improve
accessibility and clarity.
src/axolotl/integrations/kernels/sonicmoe/routing.py (1)

137-160: Add explicit group-routing invariants before view/topk (and clean up unused unpack).

This path currently assumes valid MoE metadata; invalid values will fail deep in tensor ops with hard-to-debug errors. A fast-fail check makes failures deterministic and clearer.

Proposed hardening patch
 def softmax_group_topk_routing(
     hidden_states: torch.Tensor, moe_block
 ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
     """Mistral4-style routing: softmax -> group selection -> topk -> renorm -> scale."""
     gate = moe_block.gate
-    T, H = hidden_states.shape
+    T, _ = hidden_states.shape
     K = moe_block.top_k
     E = getattr(moe_block, "n_routed_experts", gate.weight.shape[0])
     n_group = getattr(moe_block, "n_group", 1)
+    if E % n_group != 0:
+        raise ValueError(f"Invalid routing layout: n_routed_experts={E} not divisible by n_group={n_group}")
+    group_size = E // n_group
+    topk_group = getattr(moe_block, "topk_group", n_group)
+    if topk_group > n_group:
+        raise ValueError(f"Invalid topk_group={topk_group}; must be <= n_group={n_group}")
+    if n_group > 1 and group_size < 2:
+        raise ValueError(f"Invalid group_size={group_size}; group routing requires at least 2 experts per group")

     router_logits = F.linear(hidden_states, gate.weight)  # [T, E]
     router_probs = F.softmax(router_logits, dim=-1, dtype=torch.float32)  # [T, E]

     scores_for_choice = router_probs

     # Group selection: pick top groups, mask the rest
     if n_group > 1:
         group_scores = (
-            scores_for_choice.view(-1, n_group, E // n_group)
+            scores_for_choice.view(-1, n_group, group_size)
             .topk(2, dim=-1)[0]
             .sum(dim=-1)
         )
         group_idx = torch.topk(
-            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
+            group_scores, k=topk_group, dim=-1, sorted=False
         )[1]
         group_mask = torch.zeros_like(group_scores)
         group_mask.scatter_(1, group_idx, 1)
         score_mask = (
-            group_mask.unsqueeze(-1).expand(-1, n_group, E // n_group).reshape(-1, E)
+            group_mask.unsqueeze(-1).expand(-1, n_group, group_size).reshape(-1, E)
         )
         scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 137 - 160,
Add explicit fast-fail checks before the group-routing reshape and topk to
validate MoE metadata and avoid cryptic tensor errors: assert or raise a clear
ValueError if n_group <= 0, if E % n_group != 0, or if getattr(moe_block,
"topk_group", 1) > n_group; also verify hidden_states.dim() == 2 and that T
matches hidden_states.shape[0] if you keep T. Replace the unused unpack "T, H =
hidden_states.shape" with "T, _ = hidden_states.shape" (or use
hidden_states.size(0) for T), and then perform the checks prior to
scores_for_choice.view(...) and torch.topk so failures show meaningful messages
referencing n_group, E, and moe_block.topk_group.
examples/mistral4/fft-vision.yml (1)

20-20: Use a model/example-specific output_dir to avoid accidental run collisions.

Using the same ./outputs/out across the new mistral4 example files can overwrite checkpoints/logs when users try multiple configs back-to-back.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/mistral4/fft-vision.yml` at line 20, The config uses a generic
output_dir value ("./outputs/out") which can cause run collisions; update the
output_dir key in fft-vision.yml to a model/example-specific path (e.g. include
the model and example name or an experiment variable such as
"./outputs/mistral4_fft-vision" or "./outputs/${EXPERIMENT_NAME}") so each
example run writes to a unique directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/mistral4/fft-text.yml`:
- Around line 1-59: The FSDP setting references transformer_layer_cls_to_wrap:
Mistral4DecoderLayer which does not exist; update this to the actual transformer
layer class used by the Leanstral model (verify in the model implementation for
the correct class name, e.g., MistralDecoderLayer or the Leanstral-specific
layer) or remove the transformer_layer_cls_to_wrap entry so FSDP auto-detection
is used; ensure transformer_layer_cls_to_wrap is set to the exact class symbol
exported by the model code that matches base_model: mistralai/Leanstral-2603-HF.

In `@src/axolotl/common/architectures.py`:
- Line 19: The mapping entry with key "mistral4" and value "Mistral4MoE" is
invalid because transformers has no Mistral4MoE class; update the mapping in the
architectures dictionary by either renaming the key to the existing model type
("mistral") or replacing the value with an actual transformers class name (e.g.,
"MistralForCausalLM"), so get_module_class_from_name can resolve the class;
locate the entry referencing "mistral4" and "Mistral4MoE" and change it to a
supported pair (for example "mistral": "MistralForCausalLM") or remove the
unsupported mapping.

In `@src/axolotl/integrations/kernels/constants.py`:
- Around line 28-29: The mapping entry "mistral4": "Mistral4MoE" in the
model->class map is unresolved and will cause resolve_moe_block_classes to
attempt importing transformers.models.mistral4.modeling_mistral4 and raise
ModuleNotFoundError; either remove the "mistral4" entry from the mapping or
replace it with a supported model/class already present in transformers, or if
this is a custom implementation ensure the corresponding module and class are
added to the codebase and importable as modeling_mistral4 with class Mistral4MoE
so resolve_moe_block_classes can import it successfully.

---

Nitpick comments:
In `@examples/mistral4/fft-vision.yml`:
- Line 20: The config uses a generic output_dir value ("./outputs/out") which
can cause run collisions; update the output_dir key in fft-vision.yml to a
model/example-specific path (e.g. include the model and example name or an
experiment variable such as "./outputs/mistral4_fft-vision" or
"./outputs/${EXPERIMENT_NAME}") so each example run writes to a unique
directory.

In `@examples/mistral4/README.md`:
- Line 69: Replace the non-descriptive link text "here" in the README sentence
about the text dataset format with a clear, descriptive label such as "Axolotl
OpenAI Messages format documentation" or "OpenAI Messages format (Axolotl)" so
the link reads like: "The text dataset format follows the OpenAI Messages format
as seen in the Axolotl documentation." Update the anchor text accordingly in
examples/mistral4/README.md to improve accessibility and clarity.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 137-160: Add explicit fast-fail checks before the group-routing
reshape and topk to validate MoE metadata and avoid cryptic tensor errors:
assert or raise a clear ValueError if n_group <= 0, if E % n_group != 0, or if
getattr(moe_block, "topk_group", 1) > n_group; also verify hidden_states.dim()
== 2 and that T matches hidden_states.shape[0] if you keep T. Replace the unused
unpack "T, H = hidden_states.shape" with "T, _ = hidden_states.shape" (or use
hidden_states.size(0) for T), and then perform the checks prior to
scores_for_choice.view(...) and torch.topk so failures show meaningful messages
referencing n_group, E, and moe_block.topk_group.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c7fddb8c-105c-4304-9c7b-844ad0a73364

📥 Commits

Reviewing files that changed from the base of the PR and between 7da5f94 and ca89962.

📒 Files selected for processing (18)
  • examples/colab-notebooks/colab-axolotl-example.ipynb
  • examples/mistral4/README.md
  • examples/mistral4/fft-text.yml
  • examples/mistral4/fft-vision.yml
  • examples/mistral4/qlora-text.yml
  • examples/mistral4/qlora-vision.yml
  • requirements.txt
  • scripts/cutcrossentropy_install.py
  • src/axolotl/common/architectures.py
  • src/axolotl/integrations/cut_cross_entropy/README.md
  • src/axolotl/integrations/cut_cross_entropy/__init__.py
  • src/axolotl/integrations/kernels/constants.py
  • src/axolotl/integrations/kernels/plugin.py
  • src/axolotl/integrations/kernels/sonicmoe/routing.py
  • src/axolotl/loaders/model.py
  • src/axolotl/loaders/processor.py
  • src/axolotl/monkeypatch/multipack.py
  • src/axolotl/utils/config/__init__.py

Comment thread examples/mistral4/fft-text.yml Outdated
Comment on lines +1 to +59
base_model: mistralai/Leanstral-2603-HF

plugins:
- axolotl.integrations.kernels.KernelsPlugin
use_kernels: true
use_sonicmoe: true

# only train language model layers, freeze vision tower
unfrozen_parameters:
- model.language_model.*
- lm_head
- embed_tokens

datasets:
- path: fozziethebeat/alpaca_messages_2k_test
type: chat_template

dataset_prepared_path: last_run_prepared
val_set_size: 0.01
output_dir: ./outputs/out

sequence_len: 2048
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2e-5

bf16: true
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

fsdp_version: 2
fsdp_config:
offload_params: false
cpu_ram_efficient_loading: false
state_dict_type: FULL_STATE_DICT
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Mistral4DecoderLayer
reshard_after_forward: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify Mistral4DecoderLayer exists for FSDP wrapping
python -c "
try:
    from transformers.models.mistral4.modeling_mistral4 import Mistral4DecoderLayer
    print('SUCCESS: Mistral4DecoderLayer found')
except ImportError as e:
    print(f'ERROR: {e}')
"

Repository: axolotl-ai-cloud/axolotl

Length of output: 105


🏁 Script executed:

find . -type f -name "*.yml" | head -20 | xargs ls -la

Repository: axolotl-ai-cloud/axolotl

Length of output: 1696


🏁 Script executed:

fd "mistral" --type f | head -30

Repository: axolotl-ai-cloud/axolotl

Length of output: 886


🏁 Script executed:

rg "Mistral4" --type yml --type py --type txt -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 95


🏁 Script executed:

rg "mistralai/Leanstral" --type yml --type py --type txt -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 95


🏁 Script executed:

fd "\.yml$" examples/ | grep -i mistral

Repository: axolotl-ai-cloud/axolotl

Length of output: 675


🏁 Script executed:

rg "Mistral4" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 353


🏁 Script executed:

cat examples/mistral4/qlora-text.yml

Repository: axolotl-ai-cloud/axolotl

Length of output: 1037


🏁 Script executed:

rg "KernelsPlugin" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 288


🏁 Script executed:

rg "Leanstral" -r .

Repository: axolotl-ai-cloud/axolotl

Length of output: 716


🏁 Script executed:

rg "Mistral4DecoderLayer" -t py

Repository: axolotl-ai-cloud/axolotl

Length of output: 50


🏁 Script executed:

rg "DecoderLayer" -t py | grep -i mistral4

Repository: axolotl-ai-cloud/axolotl

Length of output: 50


🏁 Script executed:

rg "transformer_layer_cls_to_wrap" examples/

Repository: axolotl-ai-cloud/axolotl

Length of output: 4672


🏁 Script executed:

cat examples/mistral4/README.md

Repository: axolotl-ai-cloud/axolotl

Length of output: 2706


Mistral4DecoderLayer class does not exist in the codebase and should be verified before using this configuration.

The FSDP configuration references transformer_layer_cls_to_wrap: Mistral4DecoderLayer, but this class is not found in the codebase. While the pattern matches other supported models (e.g., MistralDecoderLayer, Qwen2DecoderLayer), the corresponding class for Mistral4 is missing. The configuration will fail at runtime when FSDP attempts to wrap layers.

Note: The README explicitly states these configs are experimental and require hyperparameter adjustments. Verify the correct transformer layer class name from the Leanstral model implementation before running this configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/mistral4/fft-text.yml` around lines 1 - 59, The FSDP setting
references transformer_layer_cls_to_wrap: Mistral4DecoderLayer which does not
exist; update this to the actual transformer layer class used by the Leanstral
model (verify in the model implementation for the correct class name, e.g.,
MistralDecoderLayer or the Leanstral-specific layer) or remove the
transformer_layer_cls_to_wrap entry so FSDP auto-detection is used; ensure
transformer_layer_cls_to_wrap is set to the exact class symbol exported by the
model code that matches base_model: mistralai/Leanstral-2603-HF.

"qwen3_vl_moe": "Qwen3VLMoeTextSparseMoeBlock",
"deepseek_v2": "DeepseekV2MoE",
"deepseek_v3": "DeepseekV3MoE",
"mistral4": "Mistral4MoE",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for Mistral4MoE class definition in transformers
python -c "
try:
    from transformers.models.mistral4.modeling_mistral4 import Mistral4MoE
    print('SUCCESS: Mistral4MoE class found')
except ImportError as e:
    print(f'ERROR: {e}')
except Exception as e:
    print(f'ERROR: {e}')
"

Repository: axolotl-ai-cloud/axolotl

Length of output: 105


🏁 Script executed:

#!/bin/bash
# First, let's look at the full context of the MOE_ARCH_BLOCK mapping
cat -n src/axolotl/common/architectures.py | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 1004


🏁 Script executed:

#!/bin/bash
# Search for references to MOE_ARCH_BLOCK usage
rg "MOE_ARCH_BLOCK" --type py -B 2 -A 2

Repository: axolotl-ai-cloud/axolotl

Length of output: 1115


🏁 Script executed:

#!/bin/bash
# Search for any references to Mistral4MoE in the codebase
rg "Mistral4MoE" --type py

Repository: axolotl-ai-cloud/axolotl

Length of output: 211


🏁 Script executed:

#!/bin/bash
# Search for mistral4 model references
rg "mistral4" --type py -i

Repository: axolotl-ai-cloud/axolotl

Length of output: 753


🏁 Script executed:

#!/bin/bash
# Check what other models are in similar architecture mappings
rg "MoE\|Moe" src/axolotl/common/architectures.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Look for how transformers models are imported/loaded
rg "from transformers" src/axolotl/loaders/model.py | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 327


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or requirements about mistral4
fd "requirements" --type f | xargs cat 2>/dev/null | grep -i mistral

Repository: axolotl-ai-cloud/axolotl

Length of output: 90


🏁 Script executed:

#!/bin/bash
# Look for transformers version or mistral-related dependencies
cat setup.py 2>/dev/null || cat pyproject.toml 2>/dev/null | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 8078


🌐 Web query:

mistral4 model transformers library

💡 Result:

There isn’t a dedicated “Mistral4” architecture/class in the Hugging Face transformers library; official support is for Mistral (and related families like Mistral3/Mixtral) and you typically load checkpoints via the Auto* APIs. [1]

If by “mistral4” you mean a community Hugging Face repo named .../mistral4 (e.g., kouki13/mistral4), it’s just a model name on the Hub and will generally still load as a Mistral-compatible causal LM via AutoModelForCausalLM. [4]

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "kouki13/mistral4"  # or your HF repo id
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

transformers added first-party Mistral support starting with v4.34.0 (and it has been maintained since). [2]

If you meant Mistral AI’s hosted “latest” models (API models like “Mistral Large/Medium/Small/…”, etc.), those are listed in Mistral’s model overview docs, but they’re not a single “mistral4” transformers model class. [3]

Sources: [1] [2] [3] [4]


Mistral4MoE class does not exist in the transformers library.

The transformers library does not include a mistral4 architecture or Mistral4MoE class (only the standard mistral architecture is supported). This mapping will fail at runtime when get_module_class_from_name attempts to resolve the class from the loaded model's hierarchy for DeepSpeed Zero3 leaf module configuration. Either the model type or class name needs to be corrected to match an existing transformers architecture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/common/architectures.py` at line 19, The mapping entry with key
"mistral4" and value "Mistral4MoE" is invalid because transformers has no
Mistral4MoE class; update the mapping in the architectures dictionary by either
renaming the key to the existing model type ("mistral") or replacing the value
with an actual transformers class name (e.g., "MistralForCausalLM"), so
get_module_class_from_name can resolve the class; locate the entry referencing
"mistral4" and "Mistral4MoE" and change it to a supported pair (for example
"mistral": "MistralForCausalLM") or remove the unsupported mapping.

Comment on lines +28 to +29
# softmax -> topk routing (with group-based expert selection)
"mistral4": "Mistral4MoE",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the module path and class exist
python -c "
import importlib
try:
    module = importlib.import_module('transformers.models.mistral4.modeling_mistral4')
    cls = getattr(module, 'Mistral4MoE', None)
    if cls:
        print('SUCCESS: Mistral4MoE found in transformers.models.mistral4.modeling_mistral4')
    else:
        print('ERROR: Mistral4MoE class not found in module')
except ModuleNotFoundError as e:
    print(f'ERROR: Module not found - {e}')
"

Repository: axolotl-ai-cloud/axolotl

Length of output: 124


🏁 Script executed:

# First, let's read the constants.py file to understand the full context
cat -n src/axolotl/integrations/kernels/constants.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3407


🏁 Script executed:

# Search for any existing mistral4 references in the codebase
rg "mistral4" --no-heading

Repository: axolotl-ai-cloud/axolotl

Length of output: 1036


🏁 Script executed:

# Check what other models are in the SPARSE_MOE_BLOCK dictionary to understand the pattern
rg "SPARSE_MOE_BLOCK" -A 20

Repository: axolotl-ai-cloud/axolotl

Length of output: 3467


🌐 Web query:

transformers library mistral4 modeling class Mistral4MoE

💡 Result:

transformers does not currently include any modeling class named Mistral4MoE (and there’s no official “Mistral4” modeling module in the library).

If you meant Mistral’s MoE model “Mixtral”, the corresponding Hugging Face Transformers modeling classes live under transformers.models.mixtral, notably:

  • MixtralModel
  • MixtralForCausalLM (plus seq/token classification + QA variants) [1]

Example import:

from transformers import MixtralForCausalLM
# or: from transformers.models.mixtral.modeling_mixtral import MixtralForCausalLM
``` [1][2]

**Sources:** [1][2]

---



</details>

**Remove mistral4 entry or verify transformers support exists.**

The mapping `"mistral4": "Mistral4MoE"` cannot be resolved. The transformers library does not include a `mistral4` module or `Mistral4MoE` class. The `resolve_moe_block_classes` function will fail with `ModuleNotFoundError` when attempting to import `transformers.models.mistral4.modeling_mistral4`. Confirm whether this is a custom addition requiring external implementation or if a different model type should be used instead.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @src/axolotl/integrations/kernels/constants.py around lines 28 - 29, The
mapping entry "mistral4": "Mistral4MoE" in the model->class map is unresolved
and will cause resolve_moe_block_classes to attempt importing
transformers.models.mistral4.modeling_mistral4 and raise ModuleNotFoundError;
either remove the "mistral4" entry from the mapping or replace it with a
supported model/class already present in transformers, or if this is a custom
implementation ensure the corresponding module and class are added to the
codebase and importable as modeling_mistral4 with class Mistral4MoE so
resolve_moe_block_classes can import it successfully.


</details>

<!-- fingerprinting:phantom:poseidon:ocelot -->

<!-- This is an auto-generated comment by CodeRabbit -->

@codecov

codecov Bot commented Mar 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 12.19512% with 36 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...c/axolotl/integrations/kernels/sonicmoe/routing.py 3.33% 29 Missing ⚠️
src/axolotl/integrations/kernels/plugin.py 0.00% 3 Missing ⚠️
src/axolotl/loaders/model.py 0.00% 3 Missing ⚠️
src/axolotl/utils/config/__init__.py 75.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@NanoCode012 NanoCode012 changed the title feat: add Leanstral (mistral 4) feat: add Mistral Small 4 Mar 17, 2026
@NanoCode012 NanoCode012 merged commit a098df5 into main Mar 17, 2026
12 of 15 checks passed
@NanoCode012 NanoCode012 deleted the feat/mistral4 branch March 17, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant