Skip to content

Feat: add Magistral Small 2509 and native mistral3 tokenizer support#3165

Merged
NanoCode012 merged 18 commits into
mainfrom
feat/mistral3
Sep 18, 2025
Merged

Feat: add Magistral Small 2509 and native mistral3 tokenizer support#3165
NanoCode012 merged 18 commits into
mainfrom
feat/mistral3

Conversation

@NanoCode012
Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 commented Sep 17, 2025

Description

We previously pointed users to use HF tokenizer for the mistral-small-3.1 series. This PR updates it to use their native mistral tokenizer instead. This should be more accurate.

This also adds support for the new Magistral Small 2509 model!

Additional changes:

  • added pixtral FA support
  • fixed mistral tokenizer image cast to tensor
  • improved doc on using demo dataset
  • bumped mistral-common to 1.8.5
  • removed unneeded mistral tokenizer patch
  • moved around old mistral configs.

Follow up PR:

  • Patch Mistral Tokenizer to support image key.

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • New Features

    • Added support and example config for Magistral Small-2509 (vision).
    • Introduced Mistral3 processor and processing strategy; optional mistral-common tokenizer flag.
    • Enabled Flash Attention in example configs for faster training.
  • Bug Fixes

    • Improved image tensor handling in tokenizer.
    • Fixed packed-sequence detection for Flash Attention with 1D position ids.
  • Documentation

    • New vision and thinking fine-tuning guides; updated Magistral overview.
    • Updated Mistral-Small-3.1 and Voxtral/Gemma examples with dataset download steps and setup tips.
  • Tests

    • Added integration tests for tokenizer and Flash Attention patches.
  • Chores

    • Bumped mistral-common to 1.8.5.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 17, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds Mistral/Magistral multimodal support and docs: introduces a Mistral3 processor, processing strategy, and runtime patches (tokenizer image handling and Flash Attention utils). Updates examples, configs, and docs for Magistral Small (including 2509 Vision), adjusts tokenizer/loader flows, bumps mistral-common to 1.8.5, and adds integration tests for patches.

Changes

Cohort / File(s) Summary
Docs: Multimodal page
docs/multimodal.qmd
Adds Magistral-Small-2509 entry and section; inserts vision library install tip; replaces chat_template lines with base_model YAML; minor Mistral-Small-3.1 callout.
Examples: README updates
examples/gemma3n/README.md, examples/magistral/README.md, examples/voxtral/README.md
Adds pre-finetuning dataset download steps; refocuses Magistral docs on Vision (mentions 2506/2507/2509), links to thinking/vision guides, updates getting started steps and tips.
Examples: New guides
examples/magistral/think/README.md, examples/magistral/vision/README.md
Adds dedicated fine-tuning guides for Magistral Small 2507 (thinking) and 2509 (vision), including prerequisites, commands, dataset formats, and constraints.
Examples: Vision config (new)
examples/magistral/vision/magistral-small-vision-24B-qlora.yml
Adds 24B QLoRA config for Magistral-Small-2509 VLM with CutCrossEntropy, 4-bit, LORA targets, dataset and training settings, and tokenizer_use_mistral_common via processor.
Examples: Config updates
examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml, examples/pixtral/lora-12b.yml
Switches to flash_attention: true; removes sdp_attention; updates dataset to Nanobit/text-vision-2k-test; adds tokenizer_use_mistral_common: true; removes top-level chat_template.
Dependency bump
requirements.txt
Updates mistral-common from 1.8.3 to 1.8.5.
Loaders: patching and processor flow
src/axolotl/loaders/patch_manager.py, src/axolotl/loaders/processor.py, src/axolotl/loaders/tokenizer.py
Adds conditional runtime patches for Mistral3 tokenizer and Flash Attention; introduces early-return path for Mistral3Processor when tokenizer_use_mistral_common is true; simplifies MistralCommon tokenizer load (removes internal transformer patch).
Monkeypatches (new)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py, src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py
Adds tokenizer patch to optimize image tensor creation in apply_chat_template; adds patch to handle 1D position_ids in HF Flash Attention utils with unpatch support.
Processing strategy and utilities
src/axolotl/processing_strategies.py, src/axolotl/utils/mistral/__init__.py, src/axolotl/utils/mistral/mistral3_processor.py
Introduces Mistral3ProcessingStrategy for image-token masking; exports Mistral3Processor; adds new processor implementing image-aware apply_chat_template and HF-compatible API.
Tests: integration for patches
tests/e2e/patched/test_mistral_tokenizer_patch.py, tests/e2e/patched/test_pixtral_flash_attention_patch.py, tests/e2e/patched/test_voxtral_modeling_patch.py
Adds tests to validate tokenizer image patch, Flash Attention patch (with unpatch), and Voxtral forward patch application/restoration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

ready to merge, scheduled_release

Suggested reviewers

  • winglian
  • djsaunde

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Feat: add Magistral Small 2509 and native mistral3 tokenizer support" is concise and accurately highlights the primary changes in the PR—adding Magistral-Small-2509 and introducing native Mistral3 tokenizer/processor support—so it is directly related to the changes in the diff (docs, processor/tokenizer, and related patches). It is specific enough for a quick scan and avoids noisy details while remaining short and clear.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@NanoCode012 NanoCode012 marked this pull request as ready for review September 17, 2025 17:13
@NanoCode012 NanoCode012 changed the title Feat: add native mistral3 tokenizer support Feat: add Magistral Small 2509 and native mistral3 tokenizer support Sep 17, 2025
@NanoCode012 NanoCode012 requested a review from a team September 17, 2025 17:13
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 17, 2025

📖 Documentation Preview: https://68cb9187a95507bda40326d5--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit e7f5d44

@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 17, 2025

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (23)
examples/pixtral/lora-12b.yml (1)

48-48: Flash Attention default: ensure availability/fallback

Setting flash_attention: true will hard-require FA2. If FA isn’t installed or the patch fails, training will crash. Consider documenting a fallback (e.g., keep sdp_attention: true in commented form) and reference the new Pixtral FA patch prerequisite.

examples/gemma3n/README.md (1)

26-33: Clarify download location for sample assets

The wget commands drop files into CWD; some users will run from repo root and later configs won’t find them. Add a target dir (e.g., data/) or a note to place them where the config expects.

- wget https://.../African_elephant.jpg
- wget https://.../En-us-African_elephant.oga
+ mkdir -p data
+ wget -P data https://.../African_elephant.jpg
+ wget -P data https://.../En-us-African_elephant.oga
examples/voxtral/README.md (1)

30-38: Pin mistral_common[audio] to 1.8.5 to match requirements

Docs install mistral_common[audio]==1.8.3 while the project now depends on 1.8.5. Update to avoid version skew during finetuning.

Apply:

- pip3 install 'mistral_common[audio]==1.8.3'
+ pip3 install 'mistral_common[audio]==1.8.5'
examples/magistral/README.md (1)

65-67: Call out how to enable mistral-common path in configs

Limitations mention mistral-common only; add a short pointer to set tokenizer_use_mistral_common: true (as in the example config) to reduce user confusion migrating from HF tokenizer.

-We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment and for `type: chat_template` only.
+We only support the `mistral-common` tokenizer for Supervised Fine-tuning (set `tokenizer_use_mistral_common: true` in your config) and for `type: chat_template` only.
src/axolotl/processing_strategies.py (2)

425-453: Guard deep attribute access for special_ids

processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids is brittle; add a clear error if structure changes to avoid AttributeError at runtime.

-        special_ids = (
-            processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids
-        )
+        try:
+            special_ids = (
+                processor.tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids
+            )
+        except AttributeError as e:
+            raise RuntimeError(
+                "Mistral3ProcessingStrategy: unable to resolve image special_ids "
+                "(expected tokenizer.tokenizer.instruct_tokenizer.image_encoder.special_ids). "
+                "Ensure mistral-common==1.8.5 and the Mistral tokenizer is selected."
+            ) from e

469-473: Avoid passing dummy chat_template for Mistral3

HFMistralTokenizer.chat_template returns a dummy string; skip auto-wiring chat_template when using Mistral3 to prevent accidental use downstream.

-    if chat_template_type in [None, "tokenizer_default"] and hasattr(
-        processor.tokenizer, "chat_template"
-    ):
-        processing_kwargs["chat_template"] = processor.tokenizer.chat_template
+    if (
+        chat_template_type in [None, "tokenizer_default"]
+        and hasattr(processor.tokenizer, "chat_template")
+        and not isinstance(processor, Mistral3Processor)
+    ):
+        processing_kwargs["chat_template"] = processor.tokenizer.chat_template

Also applies to: 497-501

src/axolotl/loaders/patch_manager.py (1)

171-177: Gate mistral tokenizer patch on the correct flag, not processor_type.

Relying on cfg.processor_type can silently skip the patch when users don’t set it. The patch is needed only when using the native mistral-common tokenizer path; gate on cfg.tokenizer_use_mistral_common instead.

Apply:

-        if self.cfg.model_config_type == "mistral3" and self.cfg.processor_type:
+        if self.cfg.model_config_type == "mistral3" and getattr(self.cfg, "tokenizer_use_mistral_common", False):
             from axolotl.monkeypatch.models.mistral3.mistral_common_tokenizer import (
                 apply_mistral_tokenizer_image_patch,
             )

             apply_mistral_tokenizer_image_patch()
docs/multimodal.qmd (2)

98-105: Call out the config flag required for native Mistral tokenizer.

Users need tokenizer_use_mistral_common: true to activate this path. Add a one-liner to prevent confusion.

Apply:

 ::: {.callout-tip}
 Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
 :::
 
 ```yaml
 base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

+Note: enable the native tokenizer path with tokenizer_use_mistral_common: true in your training config.


---

`106-115`: **Mirror the same native-tokenizer note for Magistral.**

Keep guidance consistent across models that use the native mistral tokenizer.

Apply:

```diff
 ::: {.callout-tip}
 Please make sure to install vision lib via `pip install 'mistral-common[opencv]==1.8.5'`
 :::
 
 ```yaml
 base_model: mistralai/Magistral-Small-2509

+Note: set tokenizer_use_mistral_common: true in your config to use the native tokenizer with image support.


</blockquote></details>
<details>
<summary>src/axolotl/loaders/processor.py (1)</summary><blockquote>

`24-29`: **Improve error message when the native processor is unavailable.**

Wrap the import to give users a clear action (install mistral-common with opencv).

Apply:

```diff
-    if cfg.tokenizer_use_mistral_common:
-        from axolotl.utils.mistral import Mistral3Processor
+    if cfg.tokenizer_use_mistral_common:
+        try:
+            from axolotl.utils.mistral import Mistral3Processor
+        except ImportError as e:
+            raise ImportError(
+                "Mistral3Processor not found. Please install vision deps: "
+                "pip install 'mistral-common[opencv]==1.8.5'"
+            ) from e
 
         return Mistral3Processor(
             tokenizer=tokenizer,
         )
examples/magistral/vision/README.md (2)

12-16: Mention the config flag needed to activate the native tokenizer.

Add a short note so users don’t miss tokenizer_use_mistral_common: true.

Apply:

 1. Install the required vision lib:
     ```bash
     pip install 'mistral-common[opencv]==1.8.5'
     ```
+
+Also ensure your YAML sets `tokenizer_use_mistral_common: true`.

42-43: Clarify unsupported image inputs.

Explicitly list the supported keys to avoid ambiguity.

Apply:

-One exception is that, passing `"image": PIL.Image` is not supported. MistralTokenizer only supports `path`, `url`, and `base64` for now.
+Note: passing `"image": PIL.Image` is not supported. The Mistral tokenizer currently supports only `"path"`, `"url"`, and `"base64"` for image inputs.
tests/e2e/patched/test_mistral_tokenizer_patch.py (1)

24-35: Restore original method after the test to avoid cross-test side effects.

The patch persists process-wide; restore it in a finally block for isolation.

Apply:

         # Apply patch
-        apply_mistral_tokenizer_image_patch()
+        apply_mistral_tokenizer_image_patch()
 
-        # Verify patch was applied
-        assert (
-            MistralCommonTokenizer.apply_chat_template != original_apply_chat_template
-        ), "apply_chat_template was not patched"
-
-        # Verify the method is still callable
-        assert callable(MistralCommonTokenizer.apply_chat_template), (
-            "Patched method is not callable"
-        )
+        try:
+            # Verify patch was applied
+            assert (
+                MistralCommonTokenizer.apply_chat_template != original_apply_chat_template
+            ), "apply_chat_template was not patched"
+
+            # Verify the method is still callable
+            assert callable(MistralCommonTokenizer.apply_chat_template), (
+                "Patched method is not callable"
+            )
+        finally:
+            # Restore
+            MistralCommonTokenizer.apply_chat_template = original_apply_chat_template
examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (1)

14-18: Minor: consider adding an inline note about pre-downloading images.

Good hint; optionally add a comment that absolute paths are recommended to avoid CWD issues.

src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)

27-30: Return a Python bool, not a torch.bool tensor.

The current expression can yield a 0-dim tensor; be explicit to avoid truth-value ambiguities.

Apply:

-        return (
-            batch_size == 1
-            and (increasing_position_sequences - position_ids).abs().sum().bool()
-        )
+        return (batch_size == 1) and (increasing_position_sequences != position_ids).any().item()
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (2)

23-31: Make the match-and-replace resilient to whitespace and upstream changes.

Exact string match on an indented line is brittle. Prefer a regex that preserves indentation, or match both torch.tensor(images) and torch.as_tensor(images) variants.

Apply:

+    import re
@@
-    original_tensor_conversion = (
-        "                    pixel_values = torch.tensor(images)"
-    )
+    pattern = re.compile(r"(?m)^(?P<indent>\s*)pixel_values\s*=\s*torch\.(?:as_)?tensor\(\s*images\s*\)")
@@
-    if original_tensor_conversion in original_source:
-        patched_source = original_source.replace(
-            original_tensor_conversion, patched_tensor_conversion
-        )
+    if pattern.search(original_source):
+        patched_source = pattern.sub(patched_tensor_conversion, original_source)

And keep patched_tensor_conversion using \g<indent>-aware formatting if you adopt the regex. I can provide that variant if desired.

Also applies to: 33-41


81-85: Return an unpatch callable for symmetry with other patches.

Other patches (e.g., Pixtral FA) expose unpatch(). Returning one here helps tests and controlled rollback.

Apply:

-        MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"]
-        LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")
+        old = getattr(MistralCommonTokenizer, "apply_chat_template")
+        MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"]
+        LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")
+        def _unpatch():
+            MistralCommonTokenizer.apply_chat_template = old
+        return _unpatch

Also document the return type in the docstring.

examples/magistral/vision/magistral-small-vision-24B-qlora.yml (2)

50-52: Avoid null fp16: make intent explicit.

Bare fp16: parses as null in YAML and may break boolean handling. Set it explicitly or remove.

Apply:

-bf16: true
-fp16:
+bf16: true
+fp16: false

17-25: Dataset prep flags can cause runtime surprises.

skip_prepare_dataset: true with dataset_prepared_path: last_run_prepared requires that directory to exist and match this config’s schema. Document this or default to false for out‑of‑the‑box runs.

src/axolotl/utils/mistral/mistral3_processor.py (4)

15-24: Annotate class defaults as ClassVar to satisfy Ruff and intent.

_defaults is a class-level constant; mark it as ClassVar[...].

Apply:

-from typing import Any, Dict, Optional, Union
+from typing import Any, Dict, Optional, Union, ClassVar
@@
-class Mistral3ProcessorKwargs(ProcessingKwargs):
-    _defaults: Dict[str, Dict[str, Any]] = {
+class Mistral3ProcessorKwargs(ProcessingKwargs):
+    _defaults: ClassVar[Dict[str, Dict[str, Any]]] = {

33-35: Annotate ProcessorMixin attributes as ClassVar.

Aligns with Transformers’ expectations and quiets RUF012.

Apply:

-    attributes = ["tokenizer"]
-    tokenizer_class = "HFMistralTokenizer"
+    attributes: ClassVar[list[str]] = ["tokenizer"]
+    tokenizer_class: ClassVar[str] = "HFMistralTokenizer"

120-141: Ensure image_sizes device matches pixel_values; reuse casted tensor.

Create image_sizes on the same device and read shape from the casted tensor to avoid surprises.

Apply:

-                if "pixel_values" in data:
-                    pixel_values = data["pixel_values"]
+                if "pixel_values" in data:
+                    pixel_values = data["pixel_values"]
@@
-                    data["pixel_values"] = pixel_values.to(dtype=torch.float32)
+                    pixel_values = pixel_values.to(dtype=torch.float32)
+                    data["pixel_values"] = pixel_values
@@
-                    batch_size = pixel_values.shape[0]
-                    image_sizes = torch.tensor([pixel_values.shape[-2:]] * batch_size)
+                    batch_size = pixel_values.shape[0]
+                    hw = pixel_values.shape[-2:]
+                    image_sizes = torch.tensor([hw] * batch_size, device=pixel_values.device)
                     data["image_sizes"] = image_sizes

98-108: Batched detection: tighten the check or add a fast path for list-of-dicts.

hasattr(..., "content") handles message objects, but a simpler and clearer test is to detect list-of-list vs list-of-dict explicitly.

Apply:

-        if isinstance(conversation, (list, tuple)) and (
-            isinstance(conversation[0], (list, tuple))
-            or hasattr(conversation[0], "content")
-        ):
+        if isinstance(conversation, (list, tuple)) and isinstance(conversation[0], (list, tuple)):
             is_batched = True
             conversations = conversation
         else:

If you do need to support message objects, add an explicit branch for that type.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5c427f and e518eea.

📒 Files selected for processing (21)
  • docs/multimodal.qmd (2 hunks)
  • examples/gemma3n/README.md (1 hunks)
  • examples/magistral/README.md (3 hunks)
  • examples/magistral/think/README.md (1 hunks)
  • examples/magistral/vision/README.md (1 hunks)
  • examples/magistral/vision/magistral-small-vision-24B-qlora.yml (1 hunks)
  • examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (2 hunks)
  • examples/pixtral/lora-12b.yml (1 hunks)
  • examples/voxtral/README.md (1 hunks)
  • requirements.txt (1 hunks)
  • src/axolotl/loaders/patch_manager.py (2 hunks)
  • src/axolotl/loaders/processor.py (1 hunks)
  • src/axolotl/loaders/tokenizer.py (0 hunks)
  • src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1 hunks)
  • src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1 hunks)
  • src/axolotl/processing_strategies.py (3 hunks)
  • src/axolotl/utils/mistral/__init__.py (1 hunks)
  • src/axolotl/utils/mistral/mistral3_processor.py (1 hunks)
  • tests/e2e/patched/test_mistral_tokenizer_patch.py (1 hunks)
  • tests/e2e/patched/test_pixtral_flash_attention_patch.py (1 hunks)
  • tests/e2e/patched/test_voxtral_modeling_patch.py (1 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/loaders/tokenizer.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-08T07:22:40.131Z
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#3038
File: examples/slurm/axolotl.slurm:16-16
Timestamp: 2025-08-08T07:22:40.131Z
Learning: In Axolotl (PR #3038), the preprocess codepath sets AXOLOTL_IS_PREPROCESS internally, so external scripts (e.g., examples/slurm/axolotl.slurm) need not export it for the early-return in src/axolotl/utils/data/sft.py to trigger.

Applied to files:

  • src/axolotl/loaders/processor.py
🧬 Code graph analysis (9)
tests/e2e/patched/test_mistral_tokenizer_patch.py (1)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1)
  • apply_mistral_tokenizer_image_patch (14-85)
src/axolotl/loaders/patch_manager.py (2)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (1)
  • apply_mistral_tokenizer_image_patch (14-85)
src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)
  • apply_patch_is_packed_sequence (6-42)
tests/e2e/patched/test_voxtral_modeling_patch.py (1)
src/axolotl/monkeypatch/models/voxtral/modeling.py (1)
  • patch_voxtral_conditional_generation_forward (10-67)
src/axolotl/loaders/processor.py (1)
src/axolotl/utils/mistral/mistral3_processor.py (1)
  • Mistral3Processor (27-169)
src/axolotl/utils/mistral/__init__.py (2)
src/axolotl/utils/mistral/mistral3_processor.py (1)
  • Mistral3Processor (27-169)
src/axolotl/utils/mistral/mistral_tokenizer.py (1)
  • HFMistralTokenizer (14-220)
src/axolotl/utils/mistral/mistral3_processor.py (1)
src/axolotl/utils/mistral/mistral_tokenizer.py (1)
  • HFMistralTokenizer (14-220)
src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py (2)
src/axolotl/monkeypatch/utils.py (1)
  • detab_code (232-238)
src/axolotl/utils/logging.py (1)
  • get_logger (42-49)
tests/e2e/patched/test_pixtral_flash_attention_patch.py (1)
src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py (1)
  • apply_patch_is_packed_sequence (6-42)
src/axolotl/processing_strategies.py (2)
src/axolotl/utils/mistral/mistral3_processor.py (2)
  • Mistral3Processor (27-169)
  • chat_template (41-43)
src/axolotl/utils/mistral/mistral_tokenizer.py (1)
  • chat_template (41-43)
🪛 Ruff (0.12.2)
src/axolotl/utils/mistral/mistral3_processor.py

15-24: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


33-33: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


91-93: Avoid specifying long messages outside the exception class

(TRY003)


128-128: Avoid specifying long messages outside the exception class

(TRY003)

src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py

55-55: Use of exec detected

(S102)


61-61: Use of exec detected

(S102)


62-62: Use of exec detected

(S102)


63-63: Use of exec detected

(S102)


64-64: Use of exec detected

(S102)


68-68: Use of exec detected

(S102)


69-69: Use of exec detected

(S102)


73-73: Use of exec detected

(S102)


74-74: Use of exec detected

(S102)


79-79: Use of exec detected

(S102)


82-82: Undefined name patched_apply_chat_template

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: preview
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.8.0)
  • GitHub Check: PyTest (3.11, 2.7.1)
🔇 Additional comments (8)
requirements.txt (1)

73-73: Align docs/examples to mistral-common==1.8.5 and standardize package name

requirements.txt uses mistral-common==1.8.5 but some docs still pin the underscored 1.8.3; update to mistral-common[...]==1.8.5 and prefer the dash form for consistency:

  • examples/voxtral/README.md:24 — pip3 install 'mistral_common[audio]==1.8.3'
  • docs/multimodal.qmd:119 — pip3 install ... 'mistral_common[audio]==1.8.3'

Note: vision docs already use 'mistral-common[opencv]==1.8.5' (examples/magistral/vision/README.md:14, docs/multimodal.qmd:99/109).

src/axolotl/utils/mistral/__init__.py (1)

3-6: Export OK — no import-cycle risk detected

Public export of Mistral3Processor is fine. mistral3_processor.py imports the tokenizer via its submodule (axolotl.utils.mistral.mistral_tokenizer) so there are no package-root imports inside the mistral submodules. Consumers that import from the package root: src/axolotl/loaders/processor.py, src/axolotl/loaders/tokenizer.py, src/axolotl/prompt_strategies/chat_template.py.

tests/e2e/patched/test_voxtral_modeling_patch.py (1)

9-43: LGTM: patch/unpatch integration test is correct and self-contained.

examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml (2)

4-6: LGTM: correct flag to enable native mistral tokenizer.


54-54: Confirm FA availability at runtime.

With flash_attention: true, training will fail if FA isn’t installed. Consider adding a comment pointing to installation docs or guarding via an env check in examples.

tests/e2e/patched/test_pixtral_flash_attention_patch.py (1)

22-33: LGTM: patching workflow and unpatch restoration are covered.

Good coverage of patch application, verification, and restoration of the original function.

examples/magistral/vision/magistral-small-vision-24B-qlora.yml (2)

2-6: Confirm processor selection path with tokenizer_use_mistral_common.

With processor_type: AutoProcessor plus tokenizer_use_mistral_common: true, ensure the loader selects Mistral3Processor as intended. If not, set processor_type: Mistral3Processor.

Would you like a quick repo scan script to confirm the selection logic in the loader?


35-36: Verify LoRA target regex against Magistral module names.

Ensure the regex matches actual projection param names for Magistral-Small-2509. Mismatches silently result in no trainable adapters.

I can provide a small script to list matched modules once the model is importable.

Comment thread examples/magistral/think/README.md
Comment thread src/axolotl/loaders/patch_manager.py
Comment on lines +48 to +59
items_to_import = []
for item in dir(module):
if item in patched_source and not item.startswith("_"):
items_to_import.append(item)

# Execute imports in global scope
if items_to_import:
exec( # nosec B102
f"from {module_name} import ({', '.join(items_to_import)})",
globals(),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Eliminate broad exec usage; build an isolated namespace and fetch the patched fn explicitly.

Current approach exec-imports many symbols into globals and triggers F821 on patched_apply_chat_template. Use a local ns populated from the target module (and a couple of explicit deps), exec the patched source into it, then bind from ns. This removes multiple S102 hits and fixes the undefined-name warning.

Apply:

-        # Detect what needs to be imported
-        items_to_import = []
-        for item in dir(module):
-            if item in patched_source and not item.startswith("_"):
-                items_to_import.append(item)
-
-        # Execute imports in global scope
-        if items_to_import:
-            exec(  # nosec B102
-                f"from {module_name} import ({', '.join(items_to_import)})",
-                globals(),
-            )
-
-        # Also need standard imports that might be used
-        exec("import numpy as np", globals())  # nosec B102
-        exec("import torch", globals())  # nosec B102
-        exec("from typing import Union, Optional, List, Dict, Any, Callable", globals())  # nosec B102
-        exec("from pathlib import Path", globals())  # nosec B102
-
-        # Import other dependencies that might be needed
-        try:
-            exec("from transformers.utils import is_torch_available", globals())  # nosec B102
-            exec(
-                "from transformers.tokenization_utils_base import BatchEncoding, PaddingStrategy, TensorType",
-                globals(),
-            )  # nosec B102
-            exec("from transformers.utils import logging", globals())  # nosec B102
-            exec("logger = logging.get_logger(__name__)", globals())  # nosec B102
-        except ImportError as e:
-            LOG.warning(f"Could not import some dependencies: {e}")
+        # Build an isolated namespace seeded from the tokenizer module
+        import numpy as np  # local import to avoid polluting globals
+        import torch        # local import to avoid polluting globals
+        ns: dict = dict(vars(module))  # includes imports used by the target function
+        ns.update({"np": np, "torch": torch})
+
+        # Minimal typing shims in case the source has bare aliases
+        import typing as _t
+        from pathlib import Path as _Path
+        ns.update(
+            {
+                "Union": _t.Union,
+                "Optional": _t.Optional,
+                "List": _t.List,
+                "Dict": _t.Dict,
+                "Any": _t.Any,
+                "Callable": _t.Callable,
+                "Path": _Path,
+            }
+        )
 
         # Execute the patched source
-        exec(patched_source, globals())  # nosec B102
+        exec(patched_source, ns)  # nosec B102
 
         # Replace the method
-        MistralCommonTokenizer.apply_chat_template = patched_apply_chat_template
+        MistralCommonTokenizer.apply_chat_template = ns["patched_apply_chat_template"]
         LOG.info("Successfully applied MistralCommonTokenizer tensor conversion patch")

Also applies to: 61-76, 79-83

🧰 Tools
🪛 Ruff (0.12.2)

55-55: Use of exec detected

(S102)

🤖 Prompt for AI Agents
In src/axolotl/monkeypatch/models/mistral3/mistral_common_tokenizer.py around
lines 48-59 (and similarly for ranges 61-76 and 79-83), replace the broad exec
that imports many symbols into globals with an isolated namespace approach:
create a local dict ns, populate it by importing only the explicit dependencies
from the target module (e.g., any small helpers/constants used by the patched
source) into ns, exec the patched_source into ns (no global injection), then
explicitly fetch the patched function(s) like patched_apply_chat_template =
ns["patched_apply_chat_template"] and bind them at module level as needed; this
avoids exec into globals, prevents F821 undefined-name and S102 broad exec
problems, and keeps imports limited and explicit.

Comment on lines +166 to +169
out = self.tokenizer(text, **text_kwargs)
return BatchFeature(
data=out, tensor_type=common_kwargs.pop("return_tensors", None)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Pass return_tensors/return_dict through to the tokenizer; BatchFeature won’t coerce on its own.

Right now self.tokenizer(...) is called without tensorization flags, so you wrap lists in BatchFeature. Forward the flags and return the tokenizer’s BatchFeature directly.

Apply:

-        out = self.tokenizer(text, **text_kwargs)
-        return BatchFeature(
-            data=out, tensor_type=common_kwargs.pop("return_tensors", None)
-        )
+        return_tensors = common_kwargs.get("return_tensors", "pt")
+        return_dict = common_kwargs.get("return_dict", True)
+        out = self.tokenizer(text, return_tensors=return_tensors, return_dict=return_dict, **text_kwargs)
+        # If tokenizer already returns BatchFeature/BatchEncoding, just return it
+        if isinstance(out, BatchFeature):
+            return out
+        return BatchFeature(data=out, tensor_type=return_tensors)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
out = self.tokenizer(text, **text_kwargs)
return BatchFeature(
data=out, tensor_type=common_kwargs.pop("return_tensors", None)
)
return_tensors = common_kwargs.get("return_tensors", "pt")
return_dict = common_kwargs.get("return_dict", True)
out = self.tokenizer(text, return_tensors=return_tensors, return_dict=return_dict, **text_kwargs)
# If tokenizer already returns BatchFeature/BatchEncoding, just return it
if isinstance(out, BatchFeature):
return out
return BatchFeature(data=out, tensor_type=return_tensors)
🤖 Prompt for AI Agents
In src/axolotl/utils/mistral/mistral3_processor.py around lines 166–169, the
tokenizer is called without passing tensorization flags and then wrapped into a
new BatchFeature; instead forward return_tensors and return_dict from
common_kwargs into the tokenizer call (do not pop/mutate them) so the tokenizer
returns a proper BatchFeature/tensors, and return that tokenizer result directly
rather than constructing a new BatchFeature.

Comment on lines +37 to +41
# Test with 1D position_ids (should work after patch)
position_ids_1d = torch.tensor([0, 1, 2, 3])
result = patched_fn(position_ids_1d, batch_size=1)
assert isinstance(result, bool), "Function should return a boolean"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Test expects a Python bool but the patch returns a BoolTensor. Align on bool.

fixed_is_packed_sequence currently returns a torch.bool tensor when batch_size == 1. Upstream semantics should be a Python bool. Prefer fixing the patch to return bool(...) (via .item()), else relax the test.

Two options:

  • Preferred (patch): in src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py, change the last line to:
-        return (
-            batch_size == 1
-            and (increasing_position_sequences - position_ids).abs().sum().bool()
-        )
+        is_packed = (increasing_position_sequences - position_ids).abs().sum().ne(0)
+        return bool(batch_size == 1 and is_packed.item())
  • Or (test-only fallback):
-        result = patched_fn(position_ids_1d, batch_size=1)
-        assert isinstance(result, bool), "Function should return a boolean"
+        result = patched_fn(position_ids_1d, batch_size=1)
+        assert isinstance(result, (bool, torch.Tensor)), "Expect bool-like"
+        result = bool(result)  # normalize for downstream assertions
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Test with 1D position_ids (should work after patch)
position_ids_1d = torch.tensor([0, 1, 2, 3])
result = patched_fn(position_ids_1d, batch_size=1)
assert isinstance(result, bool), "Function should return a boolean"
# Test with 1D position_ids (should work after patch)
position_ids_1d = torch.tensor([0, 1, 2, 3])
result = patched_fn(position_ids_1d, batch_size=1)
assert isinstance(result, (bool, torch.Tensor)), "Expect bool-like"
result = bool(result) # normalize for downstream assertions
🤖 Prompt for AI Agents
In tests/e2e/patched/test_pixtral_flash_attention_patch.py around lines 37 to
41, the test expects a Python bool but the patched function returns a torch.bool
tensor when batch_size == 1; update the implementation in
src/axolotl/monkeypatch/models/pixtral/modeling_flash_attention_utils.py so the
final return converts any 0-dim torch.bool to a Python bool (use .item() or
bool(...)) so callers receive a plain bool for single-element results; ensure
non-scalar tensor returns remain unchanged.

@winglian
Copy link
Copy Markdown
Collaborator

looks like the test_pixtral_flash_attention_patch test is failing

@NanoCode012
Copy link
Copy Markdown
Collaborator Author

looks like the test_pixtral_flash_attention_patch test is failing

Yep, just pushed fix for it

@NanoCode012
Copy link
Copy Markdown
Collaborator Author

Fixed integration test, failing test is unrelated SP test

@NanoCode012 NanoCode012 merged commit 09959fa into main Sep 18, 2025
20 of 22 checks passed
@NanoCode012 NanoCode012 deleted the feat/mistral3 branch September 18, 2025 08:42
@coderabbitai coderabbitai Bot mentioned this pull request Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants