feat: add Mistral Small 4#3502
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds comprehensive support for Mistral4 model training, including new configuration examples for FFT and QLoRA approaches with text and vision datasets. It updates model architecture registries, implements MoE routing logic, updates dependencies, and enhances configuration normalization and processor initialization. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
📖 Documentation Preview: https://69b8c047bc53a126672293e2--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 07c3c5b |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
examples/mistral4/README.md (1)
69-69: Consider using more descriptive link text.The link text "here" is flagged by static analysis as non-descriptive. More descriptive link text improves accessibility and helps users understand what they'll find before clicking.
📝 Suggested improvement
-- The text dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template). +- The text dataset format follows the OpenAI Messages format as documented in the [chat_template dataset format guide](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/mistral4/README.md` at line 69, Replace the non-descriptive link text "here" in the README sentence about the text dataset format with a clear, descriptive label such as "Axolotl OpenAI Messages format documentation" or "OpenAI Messages format (Axolotl)" so the link reads like: "The text dataset format follows the OpenAI Messages format as seen in the Axolotl documentation." Update the anchor text accordingly in examples/mistral4/README.md to improve accessibility and clarity.src/axolotl/integrations/kernels/sonicmoe/routing.py (1)
137-160: Add explicit group-routing invariants beforeview/topk(and clean up unused unpack).This path currently assumes valid MoE metadata; invalid values will fail deep in tensor ops with hard-to-debug errors. A fast-fail check makes failures deterministic and clearer.
Proposed hardening patch
def softmax_group_topk_routing( hidden_states: torch.Tensor, moe_block ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: """Mistral4-style routing: softmax -> group selection -> topk -> renorm -> scale.""" gate = moe_block.gate - T, H = hidden_states.shape + T, _ = hidden_states.shape K = moe_block.top_k E = getattr(moe_block, "n_routed_experts", gate.weight.shape[0]) n_group = getattr(moe_block, "n_group", 1) + if E % n_group != 0: + raise ValueError(f"Invalid routing layout: n_routed_experts={E} not divisible by n_group={n_group}") + group_size = E // n_group + topk_group = getattr(moe_block, "topk_group", n_group) + if topk_group > n_group: + raise ValueError(f"Invalid topk_group={topk_group}; must be <= n_group={n_group}") + if n_group > 1 and group_size < 2: + raise ValueError(f"Invalid group_size={group_size}; group routing requires at least 2 experts per group") router_logits = F.linear(hidden_states, gate.weight) # [T, E] router_probs = F.softmax(router_logits, dim=-1, dtype=torch.float32) # [T, E] scores_for_choice = router_probs # Group selection: pick top groups, mask the rest if n_group > 1: group_scores = ( - scores_for_choice.view(-1, n_group, E // n_group) + scores_for_choice.view(-1, n_group, group_size) .topk(2, dim=-1)[0] .sum(dim=-1) ) group_idx = torch.topk( - group_scores, k=moe_block.topk_group, dim=-1, sorted=False + group_scores, k=topk_group, dim=-1, sorted=False )[1] group_mask = torch.zeros_like(group_scores) group_mask.scatter_(1, group_idx, 1) score_mask = ( - group_mask.unsqueeze(-1).expand(-1, n_group, E // n_group).reshape(-1, E) + group_mask.unsqueeze(-1).expand(-1, n_group, group_size).reshape(-1, E) ) scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 137 - 160, Add explicit fast-fail checks before the group-routing reshape and topk to validate MoE metadata and avoid cryptic tensor errors: assert or raise a clear ValueError if n_group <= 0, if E % n_group != 0, or if getattr(moe_block, "topk_group", 1) > n_group; also verify hidden_states.dim() == 2 and that T matches hidden_states.shape[0] if you keep T. Replace the unused unpack "T, H = hidden_states.shape" with "T, _ = hidden_states.shape" (or use hidden_states.size(0) for T), and then perform the checks prior to scores_for_choice.view(...) and torch.topk so failures show meaningful messages referencing n_group, E, and moe_block.topk_group.examples/mistral4/fft-vision.yml (1)
20-20: Use a model/example-specificoutput_dirto avoid accidental run collisions.Using the same
./outputs/outacross the new mistral4 example files can overwrite checkpoints/logs when users try multiple configs back-to-back.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/mistral4/fft-vision.yml` at line 20, The config uses a generic output_dir value ("./outputs/out") which can cause run collisions; update the output_dir key in fft-vision.yml to a model/example-specific path (e.g. include the model and example name or an experiment variable such as "./outputs/mistral4_fft-vision" or "./outputs/${EXPERIMENT_NAME}") so each example run writes to a unique directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/mistral4/fft-text.yml`:
- Around line 1-59: The FSDP setting references transformer_layer_cls_to_wrap:
Mistral4DecoderLayer which does not exist; update this to the actual transformer
layer class used by the Leanstral model (verify in the model implementation for
the correct class name, e.g., MistralDecoderLayer or the Leanstral-specific
layer) or remove the transformer_layer_cls_to_wrap entry so FSDP auto-detection
is used; ensure transformer_layer_cls_to_wrap is set to the exact class symbol
exported by the model code that matches base_model: mistralai/Leanstral-2603-HF.
In `@src/axolotl/common/architectures.py`:
- Line 19: The mapping entry with key "mistral4" and value "Mistral4MoE" is
invalid because transformers has no Mistral4MoE class; update the mapping in the
architectures dictionary by either renaming the key to the existing model type
("mistral") or replacing the value with an actual transformers class name (e.g.,
"MistralForCausalLM"), so get_module_class_from_name can resolve the class;
locate the entry referencing "mistral4" and "Mistral4MoE" and change it to a
supported pair (for example "mistral": "MistralForCausalLM") or remove the
unsupported mapping.
In `@src/axolotl/integrations/kernels/constants.py`:
- Around line 28-29: The mapping entry "mistral4": "Mistral4MoE" in the
model->class map is unresolved and will cause resolve_moe_block_classes to
attempt importing transformers.models.mistral4.modeling_mistral4 and raise
ModuleNotFoundError; either remove the "mistral4" entry from the mapping or
replace it with a supported model/class already present in transformers, or if
this is a custom implementation ensure the corresponding module and class are
added to the codebase and importable as modeling_mistral4 with class Mistral4MoE
so resolve_moe_block_classes can import it successfully.
---
Nitpick comments:
In `@examples/mistral4/fft-vision.yml`:
- Line 20: The config uses a generic output_dir value ("./outputs/out") which
can cause run collisions; update the output_dir key in fft-vision.yml to a
model/example-specific path (e.g. include the model and example name or an
experiment variable such as "./outputs/mistral4_fft-vision" or
"./outputs/${EXPERIMENT_NAME}") so each example run writes to a unique
directory.
In `@examples/mistral4/README.md`:
- Line 69: Replace the non-descriptive link text "here" in the README sentence
about the text dataset format with a clear, descriptive label such as "Axolotl
OpenAI Messages format documentation" or "OpenAI Messages format (Axolotl)" so
the link reads like: "The text dataset format follows the OpenAI Messages format
as seen in the Axolotl documentation." Update the anchor text accordingly in
examples/mistral4/README.md to improve accessibility and clarity.
In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 137-160: Add explicit fast-fail checks before the group-routing
reshape and topk to validate MoE metadata and avoid cryptic tensor errors:
assert or raise a clear ValueError if n_group <= 0, if E % n_group != 0, or if
getattr(moe_block, "topk_group", 1) > n_group; also verify hidden_states.dim()
== 2 and that T matches hidden_states.shape[0] if you keep T. Replace the unused
unpack "T, H = hidden_states.shape" with "T, _ = hidden_states.shape" (or use
hidden_states.size(0) for T), and then perform the checks prior to
scores_for_choice.view(...) and torch.topk so failures show meaningful messages
referencing n_group, E, and moe_block.topk_group.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c7fddb8c-105c-4304-9c7b-844ad0a73364
📒 Files selected for processing (18)
examples/colab-notebooks/colab-axolotl-example.ipynbexamples/mistral4/README.mdexamples/mistral4/fft-text.ymlexamples/mistral4/fft-vision.ymlexamples/mistral4/qlora-text.ymlexamples/mistral4/qlora-vision.ymlrequirements.txtscripts/cutcrossentropy_install.pysrc/axolotl/common/architectures.pysrc/axolotl/integrations/cut_cross_entropy/README.mdsrc/axolotl/integrations/cut_cross_entropy/__init__.pysrc/axolotl/integrations/kernels/constants.pysrc/axolotl/integrations/kernels/plugin.pysrc/axolotl/integrations/kernels/sonicmoe/routing.pysrc/axolotl/loaders/model.pysrc/axolotl/loaders/processor.pysrc/axolotl/monkeypatch/multipack.pysrc/axolotl/utils/config/__init__.py
| base_model: mistralai/Leanstral-2603-HF | ||
|
|
||
| plugins: | ||
| - axolotl.integrations.kernels.KernelsPlugin | ||
| use_kernels: true | ||
| use_sonicmoe: true | ||
|
|
||
| # only train language model layers, freeze vision tower | ||
| unfrozen_parameters: | ||
| - model.language_model.* | ||
| - lm_head | ||
| - embed_tokens | ||
|
|
||
| datasets: | ||
| - path: fozziethebeat/alpaca_messages_2k_test | ||
| type: chat_template | ||
|
|
||
| dataset_prepared_path: last_run_prepared | ||
| val_set_size: 0.01 | ||
| output_dir: ./outputs/out | ||
|
|
||
| sequence_len: 2048 | ||
| sample_packing: true | ||
| eval_sample_packing: true | ||
| pad_to_sequence_len: true | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 1 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_bnb_8bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 2e-5 | ||
|
|
||
| bf16: true | ||
| tf32: true | ||
|
|
||
| gradient_checkpointing: true | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 1 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
|
|
||
| fsdp_version: 2 | ||
| fsdp_config: | ||
| offload_params: false | ||
| cpu_ram_efficient_loading: false | ||
| state_dict_type: FULL_STATE_DICT | ||
| auto_wrap_policy: TRANSFORMER_BASED_WRAP | ||
| transformer_layer_cls_to_wrap: Mistral4DecoderLayer | ||
| reshard_after_forward: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify Mistral4DecoderLayer exists for FSDP wrapping
python -c "
try:
from transformers.models.mistral4.modeling_mistral4 import Mistral4DecoderLayer
print('SUCCESS: Mistral4DecoderLayer found')
except ImportError as e:
print(f'ERROR: {e}')
"Repository: axolotl-ai-cloud/axolotl
Length of output: 105
🏁 Script executed:
find . -type f -name "*.yml" | head -20 | xargs ls -laRepository: axolotl-ai-cloud/axolotl
Length of output: 1696
🏁 Script executed:
fd "mistral" --type f | head -30Repository: axolotl-ai-cloud/axolotl
Length of output: 886
🏁 Script executed:
rg "Mistral4" --type yml --type py --type txt -iRepository: axolotl-ai-cloud/axolotl
Length of output: 95
🏁 Script executed:
rg "mistralai/Leanstral" --type yml --type py --type txt -iRepository: axolotl-ai-cloud/axolotl
Length of output: 95
🏁 Script executed:
fd "\.yml$" examples/ | grep -i mistralRepository: axolotl-ai-cloud/axolotl
Length of output: 675
🏁 Script executed:
rg "Mistral4" -t pyRepository: axolotl-ai-cloud/axolotl
Length of output: 353
🏁 Script executed:
cat examples/mistral4/qlora-text.ymlRepository: axolotl-ai-cloud/axolotl
Length of output: 1037
🏁 Script executed:
rg "KernelsPlugin" -t pyRepository: axolotl-ai-cloud/axolotl
Length of output: 288
🏁 Script executed:
rg "Leanstral" -r .Repository: axolotl-ai-cloud/axolotl
Length of output: 716
🏁 Script executed:
rg "Mistral4DecoderLayer" -t pyRepository: axolotl-ai-cloud/axolotl
Length of output: 50
🏁 Script executed:
rg "DecoderLayer" -t py | grep -i mistral4Repository: axolotl-ai-cloud/axolotl
Length of output: 50
🏁 Script executed:
rg "transformer_layer_cls_to_wrap" examples/Repository: axolotl-ai-cloud/axolotl
Length of output: 4672
🏁 Script executed:
cat examples/mistral4/README.mdRepository: axolotl-ai-cloud/axolotl
Length of output: 2706
Mistral4DecoderLayer class does not exist in the codebase and should be verified before using this configuration.
The FSDP configuration references transformer_layer_cls_to_wrap: Mistral4DecoderLayer, but this class is not found in the codebase. While the pattern matches other supported models (e.g., MistralDecoderLayer, Qwen2DecoderLayer), the corresponding class for Mistral4 is missing. The configuration will fail at runtime when FSDP attempts to wrap layers.
Note: The README explicitly states these configs are experimental and require hyperparameter adjustments. Verify the correct transformer layer class name from the Leanstral model implementation before running this configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/mistral4/fft-text.yml` around lines 1 - 59, The FSDP setting
references transformer_layer_cls_to_wrap: Mistral4DecoderLayer which does not
exist; update this to the actual transformer layer class used by the Leanstral
model (verify in the model implementation for the correct class name, e.g.,
MistralDecoderLayer or the Leanstral-specific layer) or remove the
transformer_layer_cls_to_wrap entry so FSDP auto-detection is used; ensure
transformer_layer_cls_to_wrap is set to the exact class symbol exported by the
model code that matches base_model: mistralai/Leanstral-2603-HF.
| "qwen3_vl_moe": "Qwen3VLMoeTextSparseMoeBlock", | ||
| "deepseek_v2": "DeepseekV2MoE", | ||
| "deepseek_v3": "DeepseekV3MoE", | ||
| "mistral4": "Mistral4MoE", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for Mistral4MoE class definition in transformers
python -c "
try:
from transformers.models.mistral4.modeling_mistral4 import Mistral4MoE
print('SUCCESS: Mistral4MoE class found')
except ImportError as e:
print(f'ERROR: {e}')
except Exception as e:
print(f'ERROR: {e}')
"Repository: axolotl-ai-cloud/axolotl
Length of output: 105
🏁 Script executed:
#!/bin/bash
# First, let's look at the full context of the MOE_ARCH_BLOCK mapping
cat -n src/axolotl/common/architectures.py | head -50Repository: axolotl-ai-cloud/axolotl
Length of output: 1004
🏁 Script executed:
#!/bin/bash
# Search for references to MOE_ARCH_BLOCK usage
rg "MOE_ARCH_BLOCK" --type py -B 2 -A 2Repository: axolotl-ai-cloud/axolotl
Length of output: 1115
🏁 Script executed:
#!/bin/bash
# Search for any references to Mistral4MoE in the codebase
rg "Mistral4MoE" --type pyRepository: axolotl-ai-cloud/axolotl
Length of output: 211
🏁 Script executed:
#!/bin/bash
# Search for mistral4 model references
rg "mistral4" --type py -iRepository: axolotl-ai-cloud/axolotl
Length of output: 753
🏁 Script executed:
#!/bin/bash
# Check what other models are in similar architecture mappings
rg "MoE\|Moe" src/axolotl/common/architectures.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Look for how transformers models are imported/loaded
rg "from transformers" src/axolotl/loaders/model.py | head -20Repository: axolotl-ai-cloud/axolotl
Length of output: 327
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation or requirements about mistral4
fd "requirements" --type f | xargs cat 2>/dev/null | grep -i mistralRepository: axolotl-ai-cloud/axolotl
Length of output: 90
🏁 Script executed:
#!/bin/bash
# Look for transformers version or mistral-related dependencies
cat setup.py 2>/dev/null || cat pyproject.toml 2>/dev/null | head -100Repository: axolotl-ai-cloud/axolotl
Length of output: 8078
🌐 Web query:
mistral4 model transformers library
💡 Result:
There isn’t a dedicated “Mistral4” architecture/class in the Hugging Face transformers library; official support is for Mistral (and related families like Mistral3/Mixtral) and you typically load checkpoints via the Auto* APIs. [1]
If by “mistral4” you mean a community Hugging Face repo named .../mistral4 (e.g., kouki13/mistral4), it’s just a model name on the Hub and will generally still load as a Mistral-compatible causal LM via AutoModelForCausalLM. [4]
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "kouki13/mistral4" # or your HF repo id
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")transformers added first-party Mistral support starting with v4.34.0 (and it has been maintained since). [2]
If you meant Mistral AI’s hosted “latest” models (API models like “Mistral Large/Medium/Small/…”, etc.), those are listed in Mistral’s model overview docs, but they’re not a single “mistral4” transformers model class. [3]
Sources: [1] [2] [3] [4]
Mistral4MoE class does not exist in the transformers library.
The transformers library does not include a mistral4 architecture or Mistral4MoE class (only the standard mistral architecture is supported). This mapping will fail at runtime when get_module_class_from_name attempts to resolve the class from the loaded model's hierarchy for DeepSpeed Zero3 leaf module configuration. Either the model type or class name needs to be corrected to match an existing transformers architecture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/common/architectures.py` at line 19, The mapping entry with key
"mistral4" and value "Mistral4MoE" is invalid because transformers has no
Mistral4MoE class; update the mapping in the architectures dictionary by either
renaming the key to the existing model type ("mistral") or replacing the value
with an actual transformers class name (e.g., "MistralForCausalLM"), so
get_module_class_from_name can resolve the class; locate the entry referencing
"mistral4" and "Mistral4MoE" and change it to a supported pair (for example
"mistral": "MistralForCausalLM") or remove the unsupported mapping.
| # softmax -> topk routing (with group-based expert selection) | ||
| "mistral4": "Mistral4MoE", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the module path and class exist
python -c "
import importlib
try:
module = importlib.import_module('transformers.models.mistral4.modeling_mistral4')
cls = getattr(module, 'Mistral4MoE', None)
if cls:
print('SUCCESS: Mistral4MoE found in transformers.models.mistral4.modeling_mistral4')
else:
print('ERROR: Mistral4MoE class not found in module')
except ModuleNotFoundError as e:
print(f'ERROR: Module not found - {e}')
"Repository: axolotl-ai-cloud/axolotl
Length of output: 124
🏁 Script executed:
# First, let's read the constants.py file to understand the full context
cat -n src/axolotl/integrations/kernels/constants.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 3407
🏁 Script executed:
# Search for any existing mistral4 references in the codebase
rg "mistral4" --no-headingRepository: axolotl-ai-cloud/axolotl
Length of output: 1036
🏁 Script executed:
# Check what other models are in the SPARSE_MOE_BLOCK dictionary to understand the pattern
rg "SPARSE_MOE_BLOCK" -A 20Repository: axolotl-ai-cloud/axolotl
Length of output: 3467
🌐 Web query:
transformers library mistral4 modeling class Mistral4MoE
💡 Result:
transformers does not currently include any modeling class named Mistral4MoE (and there’s no official “Mistral4” modeling module in the library).
If you meant Mistral’s MoE model “Mixtral”, the corresponding Hugging Face Transformers modeling classes live under transformers.models.mixtral, notably:
MixtralModelMixtralForCausalLM(plus seq/token classification + QA variants) [1]
Example import:
from transformers import MixtralForCausalLM
# or: from transformers.models.mixtral.modeling_mixtral import MixtralForCausalLM
``` [1][2]
**Sources:** [1][2]
---
</details>
**Remove mistral4 entry or verify transformers support exists.**
The mapping `"mistral4": "Mistral4MoE"` cannot be resolved. The transformers library does not include a `mistral4` module or `Mistral4MoE` class. The `resolve_moe_block_classes` function will fail with `ModuleNotFoundError` when attempting to import `transformers.models.mistral4.modeling_mistral4`. Confirm whether this is a custom addition requiring external implementation or if a different model type should be used instead.
<details>
<summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @src/axolotl/integrations/kernels/constants.py around lines 28 - 29, The
mapping entry "mistral4": "Mistral4MoE" in the model->class map is unresolved
and will cause resolve_moe_block_classes to attempt importing
transformers.models.mistral4.modeling_mistral4 and raise ModuleNotFoundError;
either remove the "mistral4" entry from the mapping or replace it with a
supported model/class already present in transformers, or if this is a custom
implementation ensure the corresponding module and class are added to the
codebase and importable as modeling_mistral4 with class Mistral4MoE so
resolve_moe_block_classes can import it successfully.
</details>
<!-- fingerprinting:phantom:poseidon:ocelot -->
<!-- This is an auto-generated comment by CodeRabbit -->
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Description
https://mistral.ai/news/mistral-small-4
FFT config is untested given lack of time. QLoRA is tested.
Requires their transformer fork huggingface/transformers#44760 (or main now)
Blocker:
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Claude helped run exp.
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)