Feat: add Magistral Small 2509 and native mistral3 tokenizer support by NanoCode012 · Pull Request #3165 · axolotl-ai-cloud/axolotl

NanoCode012 · 2025-09-17T13:21:41Z

Description

We previously pointed users to use HF tokenizer for the mistral-small-3.1 series. This PR updates it to use their native mistral tokenizer instead. This should be more accurate.

This also adds support for the new Magistral Small 2509 model!

Additional changes:

added pixtral FA support
fixed mistral tokenizer image cast to tensor
improved doc on using demo dataset
bumped mistral-common to 1.8.5
removed unneeded mistral tokenizer patch
moved around old mistral configs.

Follow up PR:

Patch Mistral Tokenizer to support image key.

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Added support and example config for Magistral Small-2509 (vision).
- Introduced Mistral3 processor and processing strategy; optional mistral-common tokenizer flag.
- Enabled Flash Attention in example configs for faster training.
Bug Fixes
- Improved image tensor handling in tokenizer.
- Fixed packed-sequence detection for Flash Attention with 1D position ids.
Documentation
- New vision and thinking fine-tuning guides; updated Magistral overview.
- Updated Mistral-Small-3.1 and Voxtral/Gemma examples with dataset download steps and setup tips.
Tests
- Added integration tests for tokenizer and Flash Attention patches.
Chores
- Bumped mistral-common to 1.8.5.

coderabbitai · 2025-09-17T13:21:47Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds Mistral/Magistral multimodal support and docs: introduces a Mistral3 processor, processing strategy, and runtime patches (tokenizer image handling and Flash Attention utils). Updates examples, configs, and docs for Magistral Small (including 2509 Vision), adjusts tokenizer/loader flows, bumps mistral-common to 1.8.5, and adds integration tests for patches.

Changes

Cohort / File(s)	Summary
Docs: Multimodal page `docs/multimodal.qmd`	Adds Magistral-Small-2509 entry and section; inserts vision library install tip; replaces chat_template lines with base_model YAML; minor Mistral-Small-3.1 callout.
Examples: README updates `examples/gemma3n/README.md`, `examples/magistral/README.md`, `examples/voxtral/README.md`	Adds pre-finetuning dataset download steps; refocuses Magistral docs on Vision (mentions 2506/2507/2509), links to thinking/vision guides, updates getting started steps and tips.
Examples: New guides `examples/magistral/think/README.md`, `examples/magistral/vision/README.md`	Adds dedicated fine-tuning guides for Magistral Small 2507 (thinking) and 2509 (vision), including prerequisites, commands, dataset formats, and constraints.
Examples: Vision config (new) `examples/magistral/vision/magistral-small-vision-24B-qlora.yml`	Adds 24B QLoRA config for Magistral-Small-2509 VLM with CutCrossEntropy, 4-bit, LORA targets, dataset and training settings, and tokenizer_use_mistral_common via processor.
Examples: Config updates `examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml`, `examples/pixtral/lora-12b.yml`	Switches to `flash_attention: true`; removes `sdp_attention`; updates dataset to `Nanobit/text-vision-2k-test`; adds `tokenizer_use_mistral_common: true`; removes top-level chat_template.
Dependency bump `requirements.txt`	Updates `mistral-common` from 1.8.3 to 1.8.5.
Loaders: patching and processor flow `src/axolotl/loaders/patch_manager.py`, `src/axolotl/loaders/processor.py`, `src/axolotl/loaders/tokenizer.py`	Adds conditional runtime patches for Mistral3 tokenizer and Flash Attention; introduces early-return path for Mistral3Processor when `tokenizer_use_mistral_common` is true; simplifies MistralCommon tokenizer load (removes internal transformer patch).
Monkeypatches (new) `src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py`, `src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py`	Adds tokenizer patch to optimize image tensor creation in `apply_chat_template`; adds patch to handle 1D `position_ids` in HF Flash Attention utils with unpatch support.
Processing strategy and utilities `src/axolotl/processing_strategies.py`, `src/axolotl/utils/mistral/__init__.py`, `src/axolotl/utils/mistral/mistral3_processor.py`	Introduces `Mistral3ProcessingStrategy` for image-token masking; exports `Mistral3Processor`; adds new processor implementing image-aware `apply_chat_template` and HF-compatible API.
Tests: integration for patches `tests/e2e/patched/test_mistral_tokenizer_patch.py`, `tests/e2e/patched/test_pixtral_flash_attention_patch.py`, `tests/e2e/patched/test_voxtral_modeling_patch.py`	Adds tests to validate tokenizer image patch, Flash Attention patch (with unpatch), and Voxtral forward patch application/restoration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

update transformers to 4.53.1 #2844 — Adjusts patching infrastructure in patch_manager; closely related to new conditional patches added here.
Feat: add devstral model support #2880 — Changes around Mistral tokenizer/processing; overlaps with tokenizer patching and apply_chat_template behavior.
Feat: Add voxtral, magistral small 1.1, and misc gemma3n fixes #2979 — Integrates Mistral tokenizer/processor and runtime patches; directly related to adding Mistral3Processor and associated flows.

Suggested labels

ready to merge, scheduled_release

Suggested reviewers

winglian
djsaunde

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Feat: add Magistral Small 2509 and native mistral3 tokenizer support" is concise and accurately highlights the primary changes in the PR—adding Magistral-Small-2509 and introducing native Mistral3 tokenizer/processor support—so it is directly related to the changes in the diff (docs, processor/tokenizer, and related patches). It is specific enough for a quick scan and avoids noisy details while remaining short and clear.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-17T17:20:17Z

📖 Documentation Preview: https://68cb9187a95507bda40326d5--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit e7f5d44

codecov · 2025-09-17T17:21:59Z

Codecov Report

❌ Patch coverage is 51.29870% with 75 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/mistral/mistral3_processor.py	27.02%	54 Missing ⚠️
src/axolotl/processing_strategies.py	29.41%	12 Missing ⚠️
src/axolotl/loaders/patch_manager.py	33.33%	4 Missing ⚠️
...ypatch/models/mistral3/mistral_common_tokenizer.py	91.89%	3 Missing ⚠️
src/axolotl/loaders/processor.py	33.33%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (23)

examples/pixtral/lora-12b.yml (1)

48-48: Flash Attention default: ensure availability/fallback

Setting flash_attention: true will hard-require FA2. If FA isn’t installed or the patch fails, training will crash. Consider documenting a fallback (e.g., keep sdp_attention: true in commented form) and reference the new Pixtral FA patch prerequisite.
examples/gemma3n/README.md (1)
26-33: Clarify download location for sample assets

The wget commands drop files into CWD; some users will run from repo root and later configs won’t find them. Add a target dir (e.g., data/) or a note to place them where the config expects.
- wget https://.../African_elephant.jpg
- wget https://.../En-us-African_elephant.oga
+ mkdir -p data
+ wget -P data https://.../African_elephant.jpg
+ wget -P data https://.../En-us-African_elephant.oga
examples/voxtral/README.md (1)
30-38: Pin mistral_common[audio] to 1.8.5 to match requirements

Docs install mistral_common[audio]==1.8.3 while the project now depends on 1.8.5. Update to avoid version skew during finetuning.

Apply:
- pip3 install 'mistral_common[audio]==1.8.3'
+ pip3 install 'mistral_common[audio]==1.8.5'
examples/magistral/README.md (1)
65-67: Call out how to enable mistral-common path in configs

Limitations mention mistral-common only; add a short pointer to set tokenizer_use_mistral_common: true (as in the example config) to reduce user confusion migrating from HF tokenizer.
-We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment and for `type: chat_template` only.
+We only support the `mistral-common` tokenizer for Supervised Fine-tuning (set `tokenizer_use_mistral_common: true` in your config) and for `type: chat_template` only.
src/axolotl/processing_strategies.py (2)
425-453: Guard deep attribute access for special_ids

processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids is brittle; add a clear error if structure changes to avoid AttributeError at runtime.
-        special_ids = (
-            processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids
-        )
+        try:
+            special_ids = (
+                processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids
+            )
+        except AttributeError as e:
+            raise RuntimeError(
+                "Mistral3ProcessingStrategy: unable to resolve image special_ids "
+                "(expected tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids). "
+                "Ensure mistral-common==1.8.5 and the Mistral tokenizer is selected."
+            ) from e
469-473: Avoid passing dummy chat_template for Mistral3

HFMistralTokenizer.chat_template returns a dummy string; skip auto-wiring chat_template when using Mistral3 to prevent accidental use downstream.
-    if chat_template_type in [None, "tokenizer_default"] and hasattr(
-        processor.tokenizer, "chat_template"
-    ):
-        processing_kwargs["chat_template"] = processor.tokenizer.chat_template
+    if (
+        chat_template_type in [None, "tokenizer_default"]
+        and hasattr(processor.tokenizer, "chat_template")
+        and not isinstance(processor, Mistral3Processor)
+    ):
+        processing_kwargs["chat_template"] = processor.tokenizer.chat_template
Also applies to: 497-501
src/axolotl/loaders/patch_manager.py (1)
171-177: Gate mistral tokenizer patch on the correct flag, not processor_type.

Relying on cfg.processor_type can silently skip the patch when users don’t set it. The patch is needed only when using the native mistral-common tokenizer path; gate on cfg.tokenizer_use_mistral_common instead.

Apply:
-        if self.cfg.model_config_type == "mistral3" and self.cfg.processor_type:
+        if self.cfg.model_config_type == "mistral3" and getattr(self.cfg, "tokenizer_use_mistral_common", False):
             from axolotl.monkeypatch.models.mistral3.mistral_common_tokenizer import (
                 apply_mistral_tokenizer_image_patch,
             )

             apply_mistral_tokenizer_image_patch()
docs/multimodal.qmd (2)
98-105: Call out the config flag required for native Mistral tokenizer.

Users need tokenizer_use_mistral_common: true to activate this path. Add a one-liner to prevent confusion.

Apply:
 ::: {.callout-tip}
 Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
 :::
 
 ```yaml
 base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
+Note: enable the native tokenizer path with tokenizer_use_mistral_common: true in your training config.
---

`106-115`: **Mirror the same native-tokenizer note for Magistral.**

Keep guidance consistent across models that use the native mistral tokenizer.

Apply:

```diff
 ::: {.callout-tip}
 Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
 :::
 
 ```yaml
 base_model: mistralai/Magistral-Small-2509
+Note: set tokenizer_use_mistral_common: true in your config to use the native tokenizer with image support.
</blockquote></details>
<details>
<summary>src/axolotl/loaders/processor.py (1)</summary><blockquote>

`24-29`: **Improve error message when the native processor is unavailable.**

Wrap the import to give users a clear action (install mistral-common with opencv).

Apply:

```diff
-    if cfg.tokenizer_use_mistral_common:
-        from axolotl.utils.mistral import Mistral3Processor
+    if cfg.tokenizer_use_mistral_common:
+        try:
+            from axolotl.utils.mistral import Mistral3Processor
+        except ImportError as e:
+            raise ImportError(
+                "Mistral3Processor not found. Please install vision deps: "
+                "pip install 'mistral-common[opencv]==1.8.5'"
+            ) from e
 
         return Mistral3Processor(
             tokenizer=tokenizer,
         )
examples/magistral/vision/README.md (2)
12-16: Mention the config flag needed to activate the native tokenizer.

Add a short note so users don’t miss tokenizer_use_mistral_common: true.

Apply:
 1. Install the required vision lib:
     ```bash
     pip install 'mistral-common[opencv]==1.8.5'
     ```
+
+Also ensure your YAML sets `tokenizer_use_mistral_common: true`.
42-43: Clarify unsupported image inputs.

Explicitly list the supported keys to avoid ambiguity.

Apply:
-One exception is that, passing `"image": PIL.Image` is not supported. MistralTokenizer only supports `path`, `url`, and `base64` for now.
+Note: passing `"image": PIL.Image` is not supported. The Mistral tokenizer currently supports only `"path"`, `"url"`, and `"base64"` for image inputs.
tests/e2e/patched/test_mistral_tokenizer_patch.py (1)
24-35: Restore original method after the test to avoid cross-test side effects.

The patch persists process-wide; restore it in a finally block for isolation.

Apply:
         # Apply patch
-        apply_mistral_tokenizer_image_patch()
+        apply_mistral_tokenizer_image_patch()
 
-        # Verify patch was applied
-        assert (
-            MistralCommonTokenizer.apply_chat_template != original_apply_chat_template
-        ), "apply_chat_template was not patched"
-
-        # Verify the method is still callable
-        assert callable(MistralCommonTokenizer.apply_chat_template), (
-            "Patched method is not callable"
-        )
+        try:
+            # Verify patch was applied
+            assert (
+                MistralCommonTokenizer.apply_chat_template != original_apply_chat_template
+            ), "apply_chat_template was not patched"
+
+            # Verify the method is still callable
+            assert callable(MistralCommonTokenizer.apply_chat_template), (
+                "Patched method is not callable"
+            )
+        finally:
+            # Restore
+            MistralCommonTokenizer.apply_chat_template = original_apply_chat_template
examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (1)

14-18: Minor: consider adding an inline note about pre-downloading images.

Good hint; optionally add a comment that absolute paths are recommended to avoid CWD issues.
src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)
27-30: Return a Python bool, not a torch.bool tensor.

The current expression can yield a 0-dim tensor; be explicit to avoid truth-value ambiguities.

Apply:
-        return (
-            batch_size == 1
-            and (increasing_position_sequences - position_ids).abs().sum().bool()
-        )
+        return (batch_size == 1) and (increasing_position_sequences != position_ids).any().item()
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (2)
23-31: Make the match-and-replace resilient to whitespace and upstream changes.

Exact string match on an indented line is brittle. Prefer a regex that preserves indentation, or match both torch.tensor(images) and torch.as_tensor(images) variants.

Apply:
+    import re
@@
-    original_tensor_conversion = (
-        "                    pixel_values = torch.tensor(images)"
-    )
+    pattern = re.compile(r"(?m)^(?P<indent>\s*)pixel_values\s*=\s*torch\.(?:as_)?tensor\(\s*images\s*\)")
@@
-    if original_tensor_conversion in original_source:
-        patched_source = original_source.replace(
-            original_tensor_conversion, patched_tensor_conversion
-        )
+    if pattern.search(original_source):
+        patched_source = pattern.sub(patched_tensor_conversion, original_source)
And keep patched_tensor_conversion using \g<indent>-aware formatting if you adopt the regex. I can provide that variant if desired.

Also applies to: 33-41

81-85: Return an unpatch callable for symmetry with other patches.

Other patches (e.g., Pixtral FA) expose unpatch(). Returning one here helps tests and controlled rollback.

Apply:
-        MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"]
-        LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")
+        old = getattr(MistralCommonTokenizer, "apply_chat_template")
+        MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"]
+        LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")
+        def _unpatch():
+            MistralCommonTokenizer.apply_chat_template = old
+        return _unpatch
Also document the return type in the docstring.
examples/magistral/vision/magistral-small-vision-24B-qlora.yml (2)
50-52: Avoid null fp16: make intent explicit.

Bare fp16: parses as null in YAML and may break boolean handling. Set it explicitly or remove.

Apply:
-bf16: true
-fp16:
+bf16: true
+fp16: false
17-25: Dataset prep flags can cause runtime surprises.

skip_prepare_dataset: true with dataset_prepared_path: last_run_prepared requires that directory to exist and match this config’s schema. Document this or default to false for out‑of‑the‑box runs.
src/axolotl/utils/mistral/mistral3_processor.py (4)
15-24: Annotate class defaults as ClassVar to satisfy Ruff and intent.

_defaults is a class-level constant; mark it as ClassVar[...].

Apply:
-from typing import Any, Dict, Optional, Union
+from typing import Any, Dict, Optional, Union, ClassVar
@@
-class Mistral3ProcessorKwargs(ProcessingKwargs):
-    _defaults: Dict[str, Dict[str, Any]] = {
+class Mistral3ProcessorKwargs(ProcessingKwargs):
+    _defaults: ClassVar[Dict[str, Dict[str, Any]]] = {
33-35: Annotate ProcessorMixin attributes as ClassVar.

Aligns with Transformers’ expectations and quiets RUF012.

Apply:
-    attributes = ["tokenizer"]
-    tokenizer_class = "HFMistralTokenizer"
+    attributes: ClassVar[list[str]] = ["tokenizer"]
+    tokenizer_class: ClassVar[str] = "HFMistralTokenizer"
120-141: Ensure image_sizes device matches pixel_values; reuse casted tensor.

Create image_sizes on the same device and read shape from the casted tensor to avoid surprises.

Apply:
-                if "pixel_values" in data:
-                    pixel_values = data["pixel_values"]
+                if "pixel_values" in data:
+                    pixel_values = data["pixel_values"]
@@
-                    data["pixel_values"] = pixel_values.to(dtype=torch.float32)
+                    pixel_values = pixel_values.to(dtype=torch.float32)
+                    data["pixel_values"] = pixel_values
@@
-                    batch_size = pixel_values.shape[0]
-                    image_sizes = torch.tensor([pixel_values.shape[-2:]] * batch_size)
+                    batch_size = pixel_values.shape[0]
+                    hw = pixel_values.shape[-2:]
+                    image_sizes = torch.tensor([hw] * batch_size, device=pixel_values.device)
                     data["image_sizes"] = image_sizes
98-108: Batched detection: tighten the check or add a fast path for list-of-dicts.

hasattr(..., "content") handles message objects, but a simpler and clearer test is to detect list-of-list vs list-of-dict explicitly.

Apply:
-        if isinstance(conversation, (list, tuple)) and (
-            isinstance(conversation[0], (list, tuple))
-            or hasattr(conversation[0], "content")
-        ):
+        if isinstance(conversation, (list, tuple)) and isinstance(conversation[0], (list, tuple)):
             is_batched = True
             conversations = conversation
         else:
If you do need to support message objects, add an explicit branch for that type.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5c427f and e518eea.

📒 Files selected for processing (21)

docs/multimodal.qmd (2 hunks)
examples/gemma3n/README.md (1 hunks)
examples/magistral/README.md (3 hunks)
examples/magistral/think/README.md (1 hunks)
examples/magistral/vision/README.md (1 hunks)
examples/magistral/vision/magistral-small-vision-24B-qlora.yml (1 hunks)
examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (2 hunks)
examples/pixtral/lora-12b.yml (1 hunks)
examples/voxtral/README.md (1 hunks)
requirements.txt (1 hunks)
src/axolotl/loaders/patch_manager.py (2 hunks)
src/axolotl/loaders/processor.py (1 hunks)
src/axolotl/loaders/tokenizer.py (0 hunks)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1 hunks)
src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1 hunks)
src/axolotl/processing_strategies.py (3 hunks)
src/axolotl/utils/mistral/__init__.py (1 hunks)
src/axolotl/utils/mistral/mistral3_processor.py (1 hunks)
tests/e2e/patched/test_mistral_tokenizer_patch.py (1 hunks)
tests/e2e/patched/test_pixtral_flash_attention_patch.py (1 hunks)
tests/e2e/patched/test_voxtral_modeling_patch.py (1 hunks)

💤 Files with no reviewable changes (1)

src/axolotl/loaders/tokenizer.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-08T07:22:40.131Z

Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#3038
File: examples/slurm/axolotl.slurm:16-16
Timestamp: 2025-08-08T07:22:40.131Z
Learning: In Axolotl (PR #3038), the preprocess codepath sets AXOLOTL_IS_PREPROCESS internally, so external scripts (e.g., examples/slurm/axolotl.slurm) need not export it for the early-return in src/axolotl/utils/data/sft.py to trigger.

Applied to files:

src/axolotl/loaders/processor.py

🧬 Code graph analysis (9)

tests/e2e/patched/test_mistral_tokenizer_patch.py (1)

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1)

apply_mistral_tokenizer_image_patch (14-85)

src/axolotl/loaders/patch_manager.py (2)

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1)

apply_mistral_tokenizer_image_patch (14-85)

src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)

apply_patch_is_packed_sequence (6-42)

tests/e2e/patched/test_voxtral_modeling_patch.py (1)

src/axolotl/monkeypatch/models/voxtral/modeling.py (1)

patch_voxtral_conditional_generation_forward (10-67)

src/axolotl/loaders/processor.py (1)

src/axolotl/utils/mistral/mistral3_processor.py (1)

Mistral3Processor (27-169)

src/axolotl/utils/mistral/__init__.py (2)

src/axolotl/utils/mistral/mistral3_processor.py (1)

Mistral3Processor (27-169)

src/axolotl/utils/mistral/mistral_tokenizer.py (1)

HFMistralTokenizer (14-220)

src/axolotl/utils/mistral/mistral3_processor.py (1)

src/axolotl/utils/mistral/mistral_tokenizer.py (1)

HFMistralTokenizer (14-220)

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (2)

src/axolotl/monkeypatch/utils.py (1)

detab_code (232-238)

src/axolotl/utils/logging.py (1)

get_logger (42-49)

tests/e2e/patched/test_pixtral_flash_attention_patch.py (1)

src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)

apply_patch_is_packed_sequence (6-42)

src/axolotl/processing_strategies.py (2)

src/axolotl/utils/mistral/mistral3_processor.py (2)

Mistral3Processor (27-169)

chat_template (41-43)

src/axolotl/utils/mistral/mistral_tokenizer.py (1)

chat_template (41-43)

🪛 Ruff (0.12.2)

src/axolotl/utils/mistral/mistral3_processor.py

15-24: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

33-33: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

91-93: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Avoid specifying long messages outside the exception class

(TRY003)

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py

55-55: Use of exec detected

(S102)

61-61: Use of exec detected

(S102)

62-62: Use of exec detected

(S102)

63-63: Use of exec detected

(S102)

64-64: Use of exec detected

(S102)

68-68: Use of exec detected

(S102)

69-69: Use of exec detected

(S102)

73-73: Use of exec detected

(S102)

74-74: Use of exec detected

(S102)

79-79: Use of exec detected

(S102)

82-82: Undefined name patched_apply_chat_template

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: preview
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.8.0)
GitHub Check: PyTest (3.11, 2.7.1)

🔇 Additional comments (8)

requirements.txt (1)

73-73: Align docs/examples to mistral-common==1.8.5 and standardize package name

requirements.txt uses mistral-common==1.8.5 but some docs still pin the underscored 1.8.3; update to mistral-common[...]==1.8.5 and prefer the dash form for consistency:

examples/voxtral/README.md:24 — pip3 install 'mistral_common[audio]==1.8.3'

docs/multimodal.qmd:119 — pip3 install ... 'mistral_common[audio]==1.8.3'

Note: vision docs already use 'mistral-common[opencv]==1.8.5' (examples/magistral/vision/README.md:14, docs/multimodal.qmd:99/109).

src/axolotl/utils/mistral/__init__.py (1)

3-6: Export OK — no import-cycle risk detected

Public export of Mistral3Processor is fine. mistral3_processor.py imports the tokenizer via its submodule (axolotl.utils.mistral.mistral_tokenizer) so there are no package-root imports inside the mistral submodules. Consumers that import from the package root: src/axolotl/loaders/processor.py, src/axolotl/loaders/tokenizer.py, src/axolotl/prompt_strategies/chat_template.py.

tests/e2e/patched/test_voxtral_modeling_patch.py (1)

9-43: LGTM: patch/unpatch integration test is correct and self-contained.

examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (2)

4-6: LGTM: correct flag to enable native mistral tokenizer.

54-54: Confirm FA availability at runtime.

With flash_attention: true, training will fail if FA isn’t installed. Consider adding a comment pointing to installation docs or guarding via an env check in examples.

tests/e2e/patched/test_pixtral_flash_attention_patch.py (1)

22-33: LGTM: patching workflow and unpatch restoration are covered.

Good coverage of patch application, verification, and restoration of the original function.

examples/magistral/vision/magistral-small-vision-24B-qlora.yml (2)

2-6: Confirm processor selection path with tokenizer_use_mistral_common.

With processor_type: AutoProcessor plus tokenizer_use_mistral_common: true, ensure the loader selects Mistral3Processor as intended. If not, set processor_type: Mistral3Processor.

Would you like a quick repo scan script to confirm the selection logic in the loader?

35-36: Verify LoRA target regex against Magistral module names.

Ensure the regex matches actual projection param names for Magistral-Small-2509. Mismatches silently result in no trainable adapters.

I can provide a small script to list matched modules once the model is importable.

coderabbitai · 2025-09-17T17:23:01Z

+        items_to_import = []
+        for item in dir(module):
+            if item in patched_source and not item.startswith("_"):
+                items_to_import.append(item)
+
+        # Execute imports in global scope
+        if items_to_import:
+            exec(  # nosec B102
+                f"from {module_name} import ({', '.join(items_to_import)})",
+                globals(),
+            )
+


🛠️ Refactor suggestion

Eliminate broad exec usage; build an isolated namespace and fetch the patched fn explicitly.

Current approach exec-imports many symbols into globals and triggers F821 on patched_apply_chat_template. Use a local ns populated from the target module (and a couple of explicit deps), exec the patched source into it, then bind from ns. This removes multiple S102 hits and fixes the undefined-name warning.

Apply:

- # Detect what needs to be imported - items_to_import = [] - for item in dir(module): - if item in patched_source and not item.startswith("_"): - items_to_import.append(item) - - # Execute imports in global scope - if items_to_import: - exec( # nosec B102 - f"from {module_name} import ({', '.join(items_to_import)})", - globals(), - ) - - # Also need standard imports that might be used - exec("import numpy as np", globals()) # nosec B102 - exec("import torch", globals()) # nosec B102 - exec("from typing import Union, Optional, List, Dict, Any, Callable", globals()) # nosec B102 - exec("from pathlib import Path", globals()) # nosec B102 - - # Import other dependencies that might be needed - try: - exec("from transformers.utils import is_torch_available", globals()) # nosec B102 - exec( - "from transformers.tokenization_utils_base import BatchEncoding, PaddingStrategy, TensorType", - globals(), - ) # nosec B102 - exec("from transformers.utils import logging", globals()) # nosec B102 - exec("logger = logging.get_logger(__name__)", globals()) # nosec B102 - except ImportError as e: - LOG.warning(f"Could not import some dependencies: {e}") + # Build an isolated namespace seeded from the tokenizer module + import numpy as np # local import to avoid polluting globals + import torch # local import to avoid polluting globals + ns: dict = dict(vars(module)) # includes imports used by the target function + ns.update({"np": np, "torch": torch}) + + # Minimal typing shims in case the source has bare aliases + import typing as _t + from pathlib import Path as _Path + ns.update( + { + "Union": _t.Union, + "Optional": _t.Optional, + "List": _t.List, + "Dict": _t.Dict, + "Any": _t.Any, + "Callable": _t.Callable, + "Path": _Path, + } + ) # Execute the patched source - exec(patched_source, globals()) # nosec B102 + exec(patched_source, ns) # nosec B102 # Replace the method - MistralCommonTokenizer.apply_chat_template = patched_apply_chat_template + MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"] LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")

Also applies to: 61-76, 79-83

🧰 Tools

🪛 Ruff (0.12.2)

55-55: Use of exec detected

(S102)

🤖 Prompt for AI Agents

In src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py around lines 48-59 (and similarly for ranges 61-76 and 79-83), replace the broad exec that imports many symbols into globals with an isolated namespace approach: create a local dict ns, populate it by importing only the explicit dependencies from the target module (e.g., any small helpers/constants used by the patched source) into ns, exec the patched_source into ns (no global injection), then explicitly fetch the patched function(s) like patched_apply_chat_template = ns["patched_apply_chat_template"] and bind them at module level as needed; this avoids exec into globals, prevents F821 undefined-name and S102 broad exec problems, and keeps imports limited and explicit.

coderabbitai · 2025-09-17T17:23:01Z

+        out = self.tokenizer(text, **text_kwargs)
+        return BatchFeature(
+            data=out, tensor_type=common_kwargs.pop("return_tensors", None)
+        )


⚠️ Potential issue

Pass return_tensors/return_dict through to the tokenizer; BatchFeature won’t coerce on its own.

Right now self.tokenizer(...) is called without tensorization flags, so you wrap lists in BatchFeature. Forward the flags and return the tokenizer’s BatchFeature directly.

Apply:

- out = self.tokenizer(text, **text_kwargs) - return BatchFeature( - data=out, tensor_type=common_kwargs.pop("return_tensors", None) - ) + return_tensors = common_kwargs.get("return_tensors", "pt") + return_dict = common_kwargs.get("return_dict", True) + out = self.tokenizer(text, return_tensors=return_tensors, return_dict=return_dict, **text_kwargs) + # If tokenizer already returns BatchFeature/BatchEncoding, just return it + if isinstance(out, BatchFeature): + return out + return BatchFeature(data=out, tensor_type=return_tensors)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

out = self.tokenizer(text, **text_kwargs)

return BatchFeature(

data=out, tensor_type=common_kwargs.pop("return_tensors", None)

)

return_tensors = common_kwargs.get("return_tensors", "pt")

return_dict = common_kwargs.get("return_dict", True)

out = self.tokenizer(text, return_tensors=return_tensors, return_dict=return_dict, **text_kwargs)

# If tokenizer already returns BatchFeature/BatchEncoding, just return it

if isinstance(out, BatchFeature):

return out

return BatchFeature(data=out, tensor_type=return_tensors)

🤖 Prompt for AI Agents

In src/axolotl/utils/mistral/mistral3_processor.py around lines 166–169, the tokenizer is called without passing tensorization flags and then wrapped into a new BatchFeature; instead forward return_tensors and return_dict from common_kwargs into the tokenizer call (do not pop/mutate them) so the tokenizer returns a proper BatchFeature/tensors, and return that tokenizer result directly rather than constructing a new BatchFeature.

coderabbitai · 2025-09-17T17:23:01Z

+        # Test with 1D position_ids (should work after patch)
+        position_ids_1d = torch.tensor([0, 1, 2, 3])
+        result = patched_fn(position_ids_1d, batch_size=1)
+        assert isinstance(result, bool), "Function should return a boolean"
+


⚠️ Potential issue

Test expects a Python bool but the patch returns a BoolTensor. Align on bool.

fixed_is_packed_sequence currently returns a torch.bool tensor when batch_size == 1. Upstream semantics should be a Python bool. Prefer fixing the patch to return bool(...) (via .item()), else relax the test.

Two options:

Preferred (patch): in src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py, change the last line to:

- return ( - batch_size == 1 - and (increasing_position_sequences - position_ids).abs().sum().bool() - ) + is_packed = (increasing_position_sequences - position_ids).abs().sum().ne(0) + return bool(batch_size == 1 and is_packed.item())

Or (test-only fallback):

- result = patched_fn(position_ids_1d, batch_size=1) - assert isinstance(result, bool), "Function should return a boolean" + result = patched_fn(position_ids_1d, batch_size=1) + assert isinstance(result, (bool, torch.Tensor)), "Expect bool-like" + result = bool(result) # normalize for downstream assertions

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Test with 1D position_ids (should work after patch)

position_ids_1d = torch.tensor([0, 1, 2, 3])

result = patched_fn(position_ids_1d, batch_size=1)

assert isinstance(result, bool), "Function should return a boolean"

# Test with 1D position_ids (should work after patch)

position_ids_1d = torch.tensor([0, 1, 2, 3])

result = patched_fn(position_ids_1d, batch_size=1)

assert isinstance(result, (bool, torch.Tensor)), "Expect bool-like"

result = bool(result) # normalize for downstream assertions

🤖 Prompt for AI Agents

In tests/e2e/patched/test_pixtral_flash_attention_patch.py around lines 37 to 41, the test expects a Python bool but the patched function returns a torch.bool tensor when batch_size == 1; update the implementation in src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py so the final return converts any 0-dim torch.bool to a Python bool (use .item() or bool(...)) so callers receive a plain bool for single-element results; ensure non-scalar tensor returns remain unchanged.

winglian · 2025-09-18T03:33:34Z

looks like the test_pixtral_flash_attention_patch test is failing

NanoCode012 · 2025-09-18T03:34:44Z

looks like the test_pixtral_flash_attention_patch test is failing

Yep, just pushed fix for it

NanoCode012 · 2025-09-18T05:47:14Z

Fixed integration test, failing test is unrelated SP test

NanoCode012 added 14 commits September 16, 2025 13:06

feat: update mistral common

f1da5b0

feat: add mistral3processor

8f6b311

fix: loading

6112e75

fix: cast pixel_values to fp32

8177273

fix: image tensor conversion

af5333b

feat: add FA2 support for pixtral based models

8cdbd22

fix: update mistral small 3.1 to use native tokenizer

77c436e

fix: install tips

837c788

fix: improve info on sample dataset files

220676d

chore: move mistral configs into subfolders

9220b17

fix: remove unneeded patch

9b23299

fix: indent

73e83da

feat: add integration tests

b3d9207

chore: move

1ed0e0e

feat: add magistral 2509 docs and example

e518eea

NanoCode012 marked this pull request as ready for review September 17, 2025 17:13

NanoCode012 changed the title ~~Feat: add native mistral3 tokenizer support~~ Feat: add Magistral Small 2509 and native mistral3 tokenizer support Sep 17, 2025

NanoCode012 requested a review from a team September 17, 2025 17:13

coderabbitai Bot reviewed Sep 17, 2025

View reviewed changes

winglian approved these changes Sep 18, 2025

View reviewed changes

fix: convert tensor to bool

4ede6f4

feat: expand tests

630a630

NanoCode012 added the ready to merge label Sep 18, 2025

chore: move tests

e7f5d44

NanoCode012 merged commit 09959fa into main Sep 18, 2025
20 of 22 checks passed

NanoCode012 deleted the feat/mistral3 branch September 18, 2025 08:42

coderabbitai Bot mentioned this pull request Oct 30, 2025

feat: Add opt-out Telemetry #3237

Merged

coderabbitai Bot mentioned this pull request Feb 28, 2026

add: qwen 3.5 #3442

Merged

coderabbitai Bot mentioned this pull request Apr 25, 2026

feat: add multimodal continued pre-training (raw image+text) #3629

Open

4 tasks

-        out = self.tokenizer(text, **text_kwargs)
-        return BatchFeature(
-            data=out, tensor_type=common_kwargs.pop("return_tensors", None)
-        )
+        return_tensors = common_kwargs.get("return_tensors", "pt")
+        return_dict = common_kwargs.get("return_dict", True)
+        out = self.tokenizer(text, return_tensors=return_tensors, return_dict=return_dict, **text_kwargs)
+        # If tokenizer already returns BatchFeature/BatchEncoding, just return it
+        if isinstance(out, BatchFeature):
+            return out
+        return BatchFeature(data=out, tensor_type=return_tensors)

Uh oh!

Conversation

NanoCode012 commented Sep 17, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions Bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

winglian commented Sep 18, 2025

Uh oh!

NanoCode012 commented Sep 18, 2025

Uh oh!

NanoCode012 commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NanoCode012 commented Sep 17, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 17, 2025 •

edited

Loading

github-actions Bot commented Sep 17, 2025 •

edited

Loading

codecov Bot commented Sep 17, 2025 •

edited

Loading