Feat: Add Magistral and mistral-common tokenizer support by NanoCode012 · Pull Request #2780 · axolotl-ai-cloud/axolotl

NanoCode012 · 2025-06-11T02:40:19Z

Description

Add Magistral and mistral-common package support.

Motivation and Context

How has this been tested?

Preprocess ran locally and manual training runs on cloud.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Added support for the mistral-common tokenizer with new configuration options and dedicated tokenizing strategies.
- Introduced a detailed fine-tuning configuration file and README for the Magistral Small model using QLoRA.
- Added optional "mistral" dependency for easy installation of the mistral-common package.
Documentation
- Enhanced docs with new configuration references and a comprehensive Magistral Small fine-tuning guide.
Bug Fixes
- Disabled multiprocessing automatically when the tokenizer does not support it for improved stability.
Refactor
- Improved configuration validation to enforce compatibility rules for mistral-common tokenizer usage.
- Updated prompt strategy selection logic to accommodate new tokenizer options.

coderabbitai · 2025-06-11T02:40:27Z

Walkthrough

This update introduces support for the mistral-common tokenizer, including a HuggingFace-compatible wrapper, configuration options, and specialized prompt strategies. It adds validation logic, documentation, and example configuration for fine-tuning Magistral models. Multiprocessing is conditionally disabled for tokenizers that do not support it, and optional dependencies are updated to include mistral-common.

Changes

File(s)	Change Summary
docs/config.qmd	Documented new `tokenizer_use_mistral_common` config option for enabling mistral-common tokenizer.
examples/magistral/README.md	Authored detailed fine-tuning guide for Magistral Small model, including setup, usage, limitations, and future plans.
examples/magistral/magistral-small-qlora.yaml	Added QLoRA fine-tuning config for Magistral-Small with mistral-common tokenizer and dataset/training parameters.
src/axolotl/datasets.py	Disabled multiprocessing in `TokenizedPromptDataset` if tokenizer does not support it, based on a new `supports_multiprocessing` property.
src/axolotl/loaders/tokenizer.py	Added logic to select mistral-common tokenizer via a nested loader function if configured.
src/axolotl/prompt_strategies/chat_template.py	Introduced `MistralStrategy` and `MistralPrompter` for mistral-common tokenizer, updated strategy loader for config-based selection, and refactored method signatures.
src/axolotl/prompt_tokenizers.py	Added `supports_multiprocessing` property to `PromptTokenizingStrategy` base class.
src/axolotl/utils/mistral_tokenizer.py	New file: HuggingFace-compatible `HFMistralTokenizer` wrapper for mistral-common, with chat formatting, padding, and batch APIs.
src/axolotl/utils/schemas/config.py	Added validators to auto-enable mistral-common for Magistral models, check dependencies, and enforce incompatible config options.
src/axolotl/utils/schemas/model.py	Added optional `tokenizer_use_mistral_common` field to model config schema.
setup.py	Added `"mistral"` extra dependency group for `mistral-common==1.6.0`.
src/axolotl/integrations/kd/chat_template.py	Updated `_get_strategy_cls` method signature to accept `cfg` parameter for consistency with other strategy loaders.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Config
    participant Loader as TokenizerLoader
    participant Mistral as HFMistralTokenizer

    User->>Config: Set tokenizer_use_mistral_common = True
    Config->>Loader: load_tokenizer(cfg)
    alt mistral-common enabled
        Loader->>Mistral: from_pretrained(cfg.tokenizer_config)
        Mistral-->>Loader: HFMistralTokenizer instance
        Loader-->>Config: Return mistral-common tokenizer
    else default
        Loader->>Loader: Load generic tokenizer
        Loader-->>Config: Return generic tokenizer
    end

sequenceDiagram
    participant Trainer
    participant Dataset
    participant Tokenizer
    participant Strategy

    Trainer->>Dataset: process()
    Dataset->>Tokenizer: supports_multiprocessing?
    alt supports multiprocessing
        Dataset->>Strategy: Map with multiprocessing
    else does not support
        Dataset->>Strategy: Map with num_proc=1 (no multiprocessing)
    end

Possibly related PRs

axolotl-ai-cloud/axolotl#2680: Refactors model loading and tokenizer logic into src/axolotl/loaders, which is directly related to the new mistral-common tokenizer integration and loader changes in this PR.

Suggested reviewers

SalmanMohammadi

Poem

In fields of code where models dwell,
A rabbit hops with tales to tell—
Of Magistral's tokenizer, new and bright,
With chat and padding done just right.
Multiprocessing learns to pause,
While configs check for every clause.
🐇✨ Hooray for Mistral’s common cause!

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

NanoCode012 · 2025-06-11T02:52:11Z

+
+        return self._mistral.decode(tokens)
+
+    def pad(


Would love eyes on this section. I think I'm duplicating code / not as efficient padding.

stupid question: can't we call transformers code here instead of writing this ourselves? I admit I'm not super familiar with this area

I would love to if there's one. Do you know whether there's one?

transformers.PreTrainedTokenizer.pad exists but I'm not sure if / how much of the functionality overlaps with yours.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (12)

docs/config.qmd (1)
30-31: Clarify default value & limitations of tokenizer_use_mistral_common.

Readers don’t know whether the flag defaults to false, what side-effects exist (e.g. disables multiprocessing), or that it requires the mistral-common dep added in requirements.txt. Consider expanding the inline comment:
-# Whether to use mistral-common tokenizer. If set to True, it will use the mistral-common tokenizer.
+# Whether to use the `mistral-common` tokenizer (default: false).
+# NOTE: • Forces single-process tokenization (mistral tokenizer isn’t picklable)
+#       • Requires `mistral-common>=1.6.0`
+#       • Overrides `tokenizer_type` / `tokenizer_use_fast`
src/axolotl/utils/schemas/model.py (1)
21-21: Add schema metadata & explicit default for the new field.

tokenizer_use_mistral_common lacks the Field(..., json_schema_extra={...}) used for neighbouring attributes, and its implicit None default may complicate downstream boolean logic. Recommend:
-    tokenizer_use_mistral_common: bool | None = None
+    tokenizer_use_mistral_common: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": "Enable support for the mistral-common tokenizer"
+        },
+    )
This keeps schema docs consistent and makes intent explicit.
src/axolotl/datasets.py (1)
52-55: Good safeguard, but expose intent via a helper property

The guard works, but burying it inline makes future maintenance harder. Consider pushing this logic into PromptTokenizingStrategy itself:
-        if not getattr(self.prompt_tokenizer, "supports_multiprocessing", True):
-            num_proc = 1
+        if not self.prompt_tokenizer.supports_multiprocessing:
+            num_proc = 1
…and add the property with a default True in the base class. This removes the need for a getattr fallback every time the flag is checked.
src/axolotl/utils/schemas/config.py (1)
1260-1302: Validator added 🎉 – minor edge-case & phrasing nits

The field tokenizer_use_mistral_common must exist on the model. Double-check that ModelInputConfig actually defines it; otherwise Pydantic will never run this validator.

Message wording:

“mistral-common is required for mistral models” → might confuse users tuning non-Mistral models but enabling the flag. Suggest something clearer.
-                    "mistral-common is required for mistral models. Please install it with `pip install mistral-common`."
+                    "The `mistral-common` package is required when `tokenizer_use_mistral_common: true`. "
+                    "Install it with `pip install mistral-common`."
Purely cosmetic, but improves UX.
examples/magistral/README.md (1)
40-45: Minor wording / tone polish

A couple of tiny style tweaks flagged by LanguageTool:
-We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment and for `type: chat_template` only.
+Currently, only the `mistral-common` tokenizer is supported for supervised fine-tuning with `type: chat_template`.
Not mandatory, just readability improvements.

🧰 Tools

🪛 LanguageTool

[style] ~42-~42: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...

(AT_THE_MOMENT)
src/axolotl/prompt_strategies/chat_template.py (3)
664-666: Clean up or document commented code.

Either remove this commented debug log or add a comment explaining why it's disabled. Commented code without context becomes technical debt.
-        LOG.debug(f"Content boundaries: {start_idx}, {end_idx}")
-        # LOG.debug(
-        #     f"Content tokens: {self.tokenizer.convert_ids_to_tokens(full_ids[start_idx:end_idx])}"
-        # )
+        LOG.debug(f"Content boundaries: {start_idx}, {end_idx}")
+        # TODO: Enable token debugging when needed - currently disabled due to verbosity
+        # LOG.debug(
+        #     f"Content tokens: {self.tokenizer.convert_ids_to_tokens(full_ids[start_idx:end_idx])}"
+        # )
862-865: Address the TODO comment.

The TODO comment indicates this validation bypass is temporary. Please create a tracking issue or provide a timeline for implementing mistral-specific validation checks.

Would you like me to help design appropriate validation logic for mistral-common tokenizers or create a GitHub issue to track this technical debt?

875-881: Remove commented-out code.

This commented method appears to be unused. Please remove it to keep the codebase clean.
-    # def find_first_eos_token(self, input_ids, start_idx):
-    #     eos_token_id = self.tokenizer.instruct_tokenizer.tokenizer.eos_id
-    #     for i in range(start_idx, len(input_ids)):
-    #         if input_ids[i] == eos_token_id:
-    #             return i
-    #     return -1
-
src/axolotl/utils/mistral_tokenizer.py (4)
51-56: Use contextlib.suppress for cleaner exception handling.

Replace the try-except-pass pattern with contextlib.suppress as suggested by static analysis.
+from contextlib import suppress
+
         self._tokenizer_path = _get_file_path(path_or_repo_id, "tekken.json")
 
         # Try to load system prompt if available
-        try:
+        with suppress(FileNotFoundError):
             self._system_prompt = self._load_system_prompt(
                 path_or_repo_id=path_or_repo_id
             )
-        except FileNotFoundError:
-            pass
🧰 Tools

🪛 Ruff (0.11.9)

51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError)

(SIM105)

67-67: Document the protected member access.

Add a comment explaining why accessing the protected _special_token_policy is necessary.
         is_tekken = isinstance(tokenizer_, Tekkenizer)
         if is_tekken:
+            # We need to modify the special token policy to ensure special tokens are preserved
+            # during decoding. This is a necessary workaround for mistral-common tokenizer behavior.
             tokenizer_._special_token_policy = SpecialTokenPolicy.KEEP  # type: ignore  # pylint: disable=protected-access
470-470: Simplify dictionary membership check.

Remove unnecessary .keys() call as suggested by static analysis.
-                if key not in sequence_fields:
+                if key not in sequence_fields:
Actually, looking at the code more carefully, the current line is already correct. The issue is on line 470 which should be:
-            for key in f.keys():
+            for key in f:
🧰 Tools

🪛 Ruff (0.11.9)

470-470: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

396-397: Address or remove the TODO comment.

The TODO comment about checking if trimming is needed should be addressed or removed if not applicable.

Would you like me to help implement the trimming logic or determine if it's necessary for this use case?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00cda8c and c4aa4f7.

📒 Files selected for processing (13)

docs/config.qmd (1 hunks)
examples/magistral/README.md (1 hunks)
examples/magistral/magistral-small-qlora.yaml (1 hunks)
requirements.txt (1 hunks)
src/axolotl/datasets.py (1 hunks)
src/axolotl/loaders/model.py (1 hunks)
src/axolotl/loaders/tokenizer.py (2 hunks)
src/axolotl/prompt_strategies/chat_template.py (8 hunks)
src/axolotl/prompt_tokenizers.py (1 hunks)
src/axolotl/utils/mistral_tokenizer.py (1 hunks)
src/axolotl/utils/schemas/config.py (1 hunks)
src/axolotl/utils/schemas/model.py (1 hunks)
src/axolotl/utils/tokenization.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

src/axolotl/utils/tokenization.py (1)

src/axolotl/utils/mistral_tokenizer.py (1)

decode (182-189)

src/axolotl/loaders/tokenizer.py (2)

src/axolotl/utils/dict.py (1)

DictDefault (6-38)

src/axolotl/utils/mistral_tokenizer.py (2)

HFMistralTokenizer (33-489)

from_pretrained (120-138)

src/axolotl/prompt_tokenizers.py (1)

src/axolotl/prompt_strategies/chat_template.py (1)

supports_multiprocessing (867-873)

🪛 LanguageTool

examples/magistral/README.md

[style] ~24-~24: Consider using polite language here.
Context: ...gistral/magistral-small-qlora.yaml ``` Let us know how it goes. Happy finetuning! 🚀 ### ...

(INSERT_PLEASE)

[style] ~42-~42: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...

(AT_THE_MOMENT)

[uncategorized] ~58-~58: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError)

(SIM105)

470-470: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🔇 Additional comments (5)

requirements.txt (1)

70-72:
Details

✅ Verification successful

Check version pin & transitive-dependency compatibility for mistral-common.

mistral-common==1.6.0 is introduced without a companion upper-bound or environment marker.
Given the large set of tokenizer / transformers–adjacent deps already pinned, please verify that:

mistral-common 1.6.0 is compatible with transformers==4.52.3 shipped a few lines above.

No conflicting indirect pins are pulled in (e.g. tokenizers, sentencepiece, cattrs, etc.).

Run:

🏁 Script executed:
#!/usr/bin/env bash
# Show mistral-common requirements and spot conflicts with existing pins
python - <<'PY'
import importlib.metadata as md, json, sys, subprocess, re, textwrap, os
try:
    import pkg_resources, pathlib, sysconfig
except ImportError:
    import sys; sys.exit("setuptools not available")

req = subprocess.check_output(["python", "-m", "pip", "index", "versions", "mistral-common"])
print(req.decode())
PY
Length of output: 347

🏁 Script executed:
#!/usr/bin/env bash
# Fetch mistral-common 1.6.0 dependencies from PyPI and list pinned transformer/tokenizer-related deps
python - <<'PYTHON'
import urllib.request, json
data = json.load(urllib.request.urlopen('https://pypi.org/pypi/mistral-common/1.6.0/json'))
reqs = data.get('info', {}).get('requires_dist') or []
print("mistral-common 1.6.0 requires_dist:")
for r in reqs:
    print("  ", r)
PYTHON

echo -e "\nPinned versions in requirements.txt:"
grep -R "transformers==" -n requirements.txt || echo "  transformers not pinned"
grep -R "tokenizers==" -n requirements.txt || echo "  tokenizers not pinned"
grep -R "sentencepiece==" -n requirements.txt || echo "  sentencepiece not pinned"
grep -R "cattrs==" -n requirements.txt || echo "  cattrs not pinned"
Length of output: 3047

🏁 Script executed:
#!/usr/bin/env bash
set -e

# Clean target directory
rm -rf tmp_pkg
mkdir tmp_pkg

# Install mistral-common into a temporary directory without dependencies
pip install mistral-common==1.6.0 --no-deps --target tmp_pkg

# Find METADATA file
DISTINFO_DIR=$(find tmp_pkg -maxdepth 1 -type d -name "mistral_common-*.dist-info" | head -n 1)
METADATA_FILE="$DISTINFO_DIR/METADATA"

echo "=== mistral-common 1.6.0 Requires-Dist ==="
grep '^Requires-Dist:' "$METADATA_FILE" || echo "  (no Requires-Dist entries found)"

echo
echo "=== Pinned versions in requirements.txt ==="
grep -n "transformers==" requirements.txt || echo "  transformers not pinned"
grep -n "tokenizers==" requirements.txt || echo "  tokenizers not pinned"
grep -n "sentencepiece==" requirements.txt || echo "  sentencepiece not pinned"
grep -n "cattrs==" requirements.txt || echo "  cattrs not pinned"
Length of output: 1665
No conflicts detected for mistral-common 1.6.0

mistral-common 1.6.0 does not require transformers, so it won’t override or conflict with your pinned transformers==4.52.3. Its only overlapping requirement is:

• sentencepiece>=0.2.0 (no upper bound pin in your requirements)

It does not pull in tokenizers or cattrs. You can safely keep:
 # other dependencies
-mistral-common==1.6.0
+mistral-common==1.6.0
src/axolotl/prompt_tokenizers.py (1)

62-69: Good addition—enables runtime check for pickling support.

The default supports_multiprocessing = True and override in MistralStrategy cleanly gate multiprocessing usage. No issues spotted.

examples/magistral/magistral-small-qlora.yaml (1)

1-64: LGTM!

The configuration file is well-structured and follows best practices for QLoRA fine-tuning with appropriate hyperparameters.

src/axolotl/prompt_strategies/chat_template.py (2)

89-89: Good type hint improvement!

Making the conversation parameter type more specific improves code clarity and IDE support.

279-287: Good defensive programming!

The added check for eos_token attribute prevents potential AttributeError with tokenizers that don't have this attribute.

coderabbitai · 2025-06-11T02:55:12Z

+    def _load_mistral_common_tokenizer(cfg: DictDefault):
+        """Load mistral-common tokenizer"""
+        assert (
+            cfg.tokenizer_use_mistral_common
+        ), "tokenizer_use_mistral_common must be True"
+
+        from axolotl.utils.mistral_tokenizer import HFMistralTokenizer
+
+        # Load the HF-compatible wrapper around MistralTokenizer
+        tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config)
+
+        return tokenizer
+
+    if cfg.tokenizer_use_mistral_common:
+        return _load_mistral_common_tokenizer(cfg)
+


🛠️ Refactor suggestion

Pass-through of common tokenizer kwargs is lost

_load_mistral_common_tokenizer ignores flags that the generic path handles (e.g. trust_remote_code, revision). If the user sets them in the config they silently do nothing.

- tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config) + extra_kwargs = {} + if cfg.revision_of_model: + extra_kwargs["revision"] = cfg.revision_of_model + if cfg.trust_remote_code is not None: + extra_kwargs["trust_remote_code"] = cfg.trust_remote_code + + tokenizer = HFMistralTokenizer.from_pretrained( + cfg.tokenizer_config, + **extra_kwargs, + )

Forwarding these keeps behaviour parity with the standard HF path.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _load_mistral_common_tokenizer(cfg: DictDefault):

"""Load mistral-common tokenizer"""

assert (

cfg.tokenizer_use_mistral_common

), "tokenizer_use_mistral_common must be True"

from axolotl.utils.mistral_tokenizer import HFMistralTokenizer

# Load the HF-compatible wrapper around MistralTokenizer

tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config)

return tokenizer

if cfg.tokenizer_use_mistral_common:

return _load_mistral_common_tokenizer(cfg)

def _load_mistral_common_tokenizer(cfg: DictDefault):

"""Load mistral-common tokenizer"""

assert (

cfg.tokenizer_use_mistral_common

), "tokenizer_use_mistral_common must be True"

from axolotl.utils.mistral_tokenizer import HFMistralTokenizer

# Load the HF-compatible wrapper around MistralTokenizer

extra_kwargs = {}

if cfg.revision_of_model:

extra_kwargs["revision"] = cfg.revision_of_model

if cfg.trust_remote_code is not None:

extra_kwargs["trust_remote_code"] = cfg.trust_remote_code

tokenizer = HFMistralTokenizer.from_pretrained(

cfg.tokenizer_config,

**extra_kwargs,

)

return tokenizer

if cfg.tokenizer_use_mistral_common:

return _load_mistral_common_tokenizer(cfg)

🤖 Prompt for AI Agents

In src/axolotl/loaders/tokenizer.py around lines 124 to 139, the _load_mistral_common_tokenizer function does not forward common tokenizer keyword arguments like trust_remote_code and revision from the config, causing them to be ignored. Modify the function to accept and pass these additional kwargs to HFMistralTokenizer.from_pretrained by extracting them from cfg or accepting **kwargs, ensuring consistent behavior with the standard HF tokenizer loading path.

coderabbitai · 2025-06-11T02:55:12Z

+        # Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation
+        # pylint: disable=non-parent-init-called
+        # pylint: disable=super-init-not-called
+        PromptTokenizingStrategy.__init__(
+            self, prompter, tokenizer, train_on_inputs, sequence_len
+        )


⚠️ Potential issue

Document the validation bypass rationale.

Bypassing the parent's _validate_eot_and_eos_tokens() validation could lead to runtime issues. Please add a detailed comment explaining why this validation doesn't apply to mistral-common tokenizers and what safeguards are in place.

# Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation # pylint: disable=non-parent-init-called # pylint: disable=super-init-not-called + # NOTE: We bypass ChatTemplateStrategy's validation because mistral-common tokenizers + # handle special tokens differently and don't expose them in the same way as HuggingFace tokenizers. + # This is safe because [explain why it's safe or what alternative validation is done]. PromptTokenizingStrategy.__init__( self, prompter, tokenizer, train_on_inputs, sequence_len )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation

# pylint: disable=non-parent-init-called

# pylint: disable=super-init-not-called

PromptTokenizingStrategy.__init__(

self, prompter, tokenizer, train_on_inputs, sequence_len

)

# Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation

# pylint: disable=non-parent-init-called

# pylint: disable=super-init-not-called

# NOTE: We bypass ChatTemplateStrategy's validation because mistral-common tokenizers

# handle special tokens differently and don't expose them in the same way as HuggingFace tokenizers.

# This is safe because [explain why it's safe or what alternative validation is done].

PromptTokenizingStrategy.__init__(

self, prompter, tokenizer, train_on_inputs, sequence_len

)

🤖 Prompt for AI Agents

In src/axolotl/prompt_strategies/chat_template.py around lines 827 to 832, add a detailed comment above the call to PromptTokenizingStrategy.__init__ explaining why the _validate_eot_and_eos_tokens() validation is intentionally bypassed for mistral-common tokenizers. Clarify that this validation does not apply due to specific tokenizer characteristics and describe any safeguards or alternative checks in place to prevent runtime issues, ensuring the rationale is clear for future maintainers.

codecov · 2025-06-11T03:04:36Z

Codecov Report

Attention: Patch coverage is 51.89873% with 152 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/mistral_tokenizer.py	45.81%	123 Missing ⚠️
src/axolotl/utils/schemas/config.py	46.66%	16 Missing ⚠️
src/axolotl/prompt_strategies/chat_template.py	86.66%	6 Missing ⚠️
src/axolotl/loaders/tokenizer.py	33.33%	4 Missing ⚠️
src/axolotl/datasets.py	33.33%	2 Missing ⚠️
src/axolotl/integrations/kd/chat_template.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/axolotl/utils/mistral_tokenizer.py (1)

182-189: Simplify decode method to avoid recursion.

The recursive call is unnecessary. Wrap the int in a list inline instead.

     def decode(self, ids: int | list[int], skip_special_tokens: bool = True) -> str:
         if not skip_special_tokens:
             raise NotImplementedError("skip_special_tokens not supported yet")
 
         if isinstance(ids, int):
-            return self.decode([ids])
+            ids = [ids]
 
         return self._mistral.instruct_tokenizer.tokenizer.decode(ids)

🧹 Nitpick comments (10)

src/axolotl/utils/mistral_tokenizer.py (4)
51-56: Use contextlib.suppress for cleaner exception handling.
+from contextlib import suppress
+
         # Try to load system prompt if available
-        try:
+        with suppress(FileNotFoundError):
             self._system_prompt = self._load_system_prompt(
                 path_or_repo_id=path_or_repo_id
             )
-        except FileNotFoundError:
-            pass
🧰 Tools

🪛 Ruff (0.11.9)

51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError)

(SIM105)

396-397: Address the TODO comment about trimming.

The TODO comment indicates uncertainty about whether trimming is needed and if it's correct. This should be resolved before merging.

Would you like me to help analyze the trimming logic and determine if it's needed?

470-470: Simplify dictionary membership check.
-                if key not in sequence_fields:
+                if key not in sequence_fields:
Wait, I need to look at this more carefully. The line is checking key in f.keys() according to static analysis, but the code shows it differently. Let me re-read...

Actually, looking at the code:
for key in f.keys():
    if key not in sequence_fields:
The static analysis is correct - this should be simplified.
-            for key in f.keys():
+            for key in f:
                 if key not in sequence_fields:
🧰 Tools

🪛 Ruff (0.11.9)

470-470: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

289-490: Consider refactoring the pad method for better maintainability.

This method is quite complex with over 200 lines handling various padding scenarios. Consider breaking it down into smaller helper methods for each tensor type (input_ids, labels, attention_mask, position_ids) to improve readability and maintainability.

Would you like me to help refactor this method into smaller, more focused helper functions?

🧰 Tools

🪛 Ruff (0.11.9)

470-470: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)
examples/magistral/README.md (6)
3-3: Use correct term “open-source”
For accuracy and consistency, change “opensource” to “open-source.”
- Magistral Small is a 24B parameter opensource model from…
+ Magistral Small is a 24B parameter open-source model from…
16-16: Correct casing for PyTorch
Use the official “PyTorch” spelling.
- # Ensure you have Pytorch installed (we recommend Pytorch 2.6.0)
+ # Ensure you have PyTorch installed (we recommend PyTorch 2.6.0)
36-36: Add a polite prompt
Consider adding “Please” for a more friendly tone.
- Let us know how it goes. Happy finetuning! 🚀
+ Please let us know how it goes. Happy fine-tuning! 🚀
41-41: Hyphenate “fine-tuning”
Maintain consistency by using “fine-tuning” with a hyphen.
- You can run a full finetuning by removing the `adapter: qlora`…
+ You can run a full fine-tuning by removing the `adapter: qlora`…
53-53: Replace “at the moment” with “currently”
For conciseness and clarity, use “currently.”
- We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment…
+ We only support the `mistral-common` tokenizer for Supervised Fine-tuning currently…
🧰 Tools

🪛 LanguageTool

[style] ~53-~53: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...

(AT_THE_MOMENT)

69-69: Insert missing comma
Add a comma after “configs” for clarity.
- Add parity to other tokenizer configs like overriding tokens.
+ Add parity to other tokenizer configs, like overriding tokens.
🧰 Tools

🪛 LanguageTool

[uncategorized] ~69-~69: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.

(AI_HYDRA_LEO_MISSING_COMMA)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4aa4f7 and 23ec193.

📒 Files selected for processing (12)

docs/config.qmd (1 hunks)
examples/magistral/README.md (1 hunks)
examples/magistral/magistral-small-qlora.yaml (1 hunks)
setup.py (1 hunks)
src/axolotl/datasets.py (1 hunks)
src/axolotl/loaders/model.py (1 hunks)
src/axolotl/loaders/tokenizer.py (1 hunks)
src/axolotl/prompt_strategies/chat_template.py (8 hunks)
src/axolotl/prompt_tokenizers.py (1 hunks)
src/axolotl/utils/mistral_tokenizer.py (1 hunks)
src/axolotl/utils/schemas/config.py (1 hunks)
src/axolotl/utils/schemas/model.py (1 hunks)

✅ Files skipped from review due to trivial changes (3)

setup.py
docs/config.qmd
src/axolotl/loaders/model.py

🚧 Files skipped from review as they are similar to previous changes (7)

src/axolotl/prompt_tokenizers.py
src/axolotl/datasets.py
src/axolotl/utils/schemas/model.py
src/axolotl/loaders/tokenizer.py
src/axolotl/utils/schemas/config.py
examples/magistral/magistral-small-qlora.yaml
src/axolotl/prompt_strategies/chat_template.py

🧰 Additional context used

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError)

(SIM105)

470-470: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🪛 LanguageTool

examples/magistral/README.md

[style] ~35-~35: Consider using polite language here.
Context: ...gistral/magistral-small-qlora.yaml ``` Let us know how it goes. Happy finetuning! 🚀 ### ...

(INSERT_PLEASE)

[style] ~53-~53: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...

(AT_THE_MOMENT)

[uncategorized] ~69-~69: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.

(AI_HYDRA_LEO_MISSING_COMMA)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)

🔇 Additional comments (1)

examples/magistral/README.md (1)

21-21: Verify extra dependency group name
The example uses the mistral extra, but the PR introduces a mistral-common tokenizer package. Please confirm the correct extras group in setup.py.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/axolotl/utils/mistral_tokenizer.py (1)

214-221: Simplify decode method to avoid recursion.

The recursive call is unnecessary. Wrap the int in a list inline instead.

     def decode(self, ids: int | list[int], skip_special_tokens: bool = True) -> str:
         if not skip_special_tokens:
             raise NotImplementedError("skip_special_tokens not supported yet")
 
         if isinstance(ids, int):
-            return self.decode([ids])
+            ids = [ids]
 
         return self._mistral.instruct_tokenizer.tokenizer.decode(ids)

🧹 Nitpick comments (4)

src/axolotl/utils/mistral_tokenizer.py (4)
69-73: Reliance on private APIs poses maintenance risk.

The code accesses private attributes _special_token_policy and _mode from the mistral-common library. These could change without notice in future versions.

Consider:

Requesting public APIs from the mistral-common maintainers for these features

Adding version pinning and tests to detect breaking changes

Documenting these dependencies clearly

Also applies to: 84-94

154-160: Consider warning about unused kwargs.

The method accepts **kwargs but silently ignores them. Users might pass unsupported parameters without realizing they're ignored.
     @classmethod
     def from_pretrained(
         cls,
         path_or_repo_id: str,
         *,
         revision: Optional[str] = None,
-        **kwargs,  # pylint: disable=unused-argument
+        **kwargs,
     ) -> "HFMistralTokenizer":
         """
         Download a mistral tokenizer from HF Hub and wrap it.
         """
+        if kwargs:
+            import warnings
+            warnings.warn(f"Ignoring unsupported arguments: {list(kwargs.keys())}")
424-428: Address the TODO comment about trimming.

The TODO comment suggests uncertainty about whether trimming is needed and if it's correct. This should be resolved before merging.

Would you like me to analyze the trimming logic and determine if it's correctly implemented?

498-498: Simplify dictionary key check.
-                if key not in sequence_fields:
+            for key in f:
+                if key not in sequence_fields:
🧰 Tools

🪛 Ruff (0.11.9)

498-498: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23ec193 and e693b73.

📒 Files selected for processing (1)

src/axolotl/utils/mistral_tokenizer.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

56-61: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError)

(SIM105)

498-498: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: pre-commit
GitHub Check: pre-commit
GitHub Check: preview

🔇 Additional comments (2)

src/axolotl/utils/mistral_tokenizer.py (2)

317-518: Padding implementation looks correct.

The padding logic properly handles:

Different padding strategies (longest, max_length, do_not_pad)

Position IDs continuation for right padding

Proper padding values for each tensor type

Conversion to requested tensor format

The implementation aligns with HuggingFace's padding behavior.

🧰 Tools

🪛 Ruff (0.11.9)

498-498: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

26-35:
Details

❌ Incorrect review comment

Consider adding validation for the filename parameter.

The function assumes filename is safe to use in path operations. Consider validating the filename to prevent potential path traversal issues.

🏁 Script executed:
#!/bin/bash
# Check if there are any other usages of _get_file_path that might pass unsafe filenames
ast-grep --pattern '_get_file_path($_, $_)'
Length of output: 382
No additional filename validation needed

The _get_file_path function is only ever called with static literals ("tekken.json", "SYSTEM_PROMPT.txt") within this module. There’s no user-controlled input flowing into the filename parameter, so path traversal isn’t a concern and no extra validation is required.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/axolotl/integrations/kd/chat_template.py (2)
192-194: Add type annotation and clarify future-proofing

cfg is introduced in the signature but immediately discarded.
Giving the parameter a type and documenting the rationale will (a) silence type-checkers, (b) avoid the need for a pylint override, and (c) make it clear that branching on cfg is planned for later.
-def _get_strategy_cls(self, cfg):  # pylint: disable=unused-argument
-    return ChatTemplateStrategyWithKD
+def _get_strategy_cls(
+    self, cfg: Any  # noqa: D401 – kept for future config-based branching
+) -> type[ChatTemplateStrategyWithKD]:
+    # NOTE: For now the KD path always uses ChatTemplateStrategyWithKD.
+    # The `cfg` argument is accepted so that we can introduce
+    # conditional selection (e.g. MistralStrategyWithKD) without an
+    # interface change later.
+    return ChatTemplateStrategyWithKD
195-203: Make kd_temperature extraction resilient to non-dict config objects

cfg is treated as a mapping (cfg.get) but elsewhere in the codebase it can be a dataclass or AttrDict.
Falling back to getattr prevents an AttributeError in such cases.
-        if kd_temperature := cfg.get("kd_temperature"):
-            strategy_params["kd_temperature"] = kd_temperature
+        kd_temperature = (
+            cfg.get("kd_temperature")  # type: ignore[attr-defined]
+            if hasattr(cfg, "get")
+            else getattr(cfg, "kd_temperature", None)
+        )
+        if kd_temperature is not None:
+            strategy_params["kd_temperature"] = kd_temperature

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e693b73 and db3474e.

📒 Files selected for processing (1)

src/axolotl/integrations/kd/chat_template.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/axolotl/integrations/kd/chat_template.py (2)

src/axolotl/prompt_strategies/chat_template.py (1)

_get_strategy_cls (897-901)

src/axolotl/integrations/base.py (2)

cfg (319-320)

cfg (323-324)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: pre-commit
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: preview

djsaunde

Can we add a smoke test?

Unit tests would also be nice but that could come later.

djsaunde · 2025-06-11T16:56:45Z

+        if tokenizer_use_mistral_common:
+            try:
+                import mistral_common  # noqa: F401 # pylint:disable=unused-import
+            except ImportError as exception:
+                raise ImportError(
+                    "mistral-common is required for mistral models. Please install it with `pip install mistral-common`."
+                ) from exception


Do we need this guard? I see you added mistral_common to our requirements.txt so this shouldn't get hit.

Yes, I moved it to optional dependency like you've previously done.

djsaunde · 2025-06-11T18:06:29Z

 # Whether to use the legacy tokenizer setting, defaults to True
 tokenizer_legacy:
+# Whether to use mistral-common tokenizer. If set to True, it will use the mistral-common tokenizer.
+tokenizer_use_mistral_common:


I think this is fine for now, but it might be better to auto-detect needing the mistral common tokenizer as this can be a stumbling point for users.

Yup, I agree here. I'll add a simple name check for now

djsaunde · 2025-06-11T18:10:21Z

 from axolotl.utils.schemas.datasets import DatasetConfig

+if TYPE_CHECKING:
+    from axolotl.utils.mistral_tokenizer import HFMistralTokenizer


djsaunde · 2025-06-11T18:20:02Z

+            path: The path to the tokenizer files.
+        """
+        self._mistral = mistral
+        self._padding_side = "right"


stupid question: doesn't the mistral tokenizer pad from the left?

I'm not sure about this too. For fine-tuning, it makes sense to be right? Anyways, I think the below would override it either way. Not sure if this block still needs it.

axolotl/src/axolotl/loaders/tokenizer.py

Lines 174 to 175 in 00cda8c

if cfg.is_mistral_derived_model and cfg.flash_attention and not cfg.sample_packing:

tokenizer.padding_side = "left"

djsaunde · 2025-06-11T18:21:46Z

+        # Try to load system prompt if available
+        try:
+            self._system_prompt = self._load_system_prompt(
+                path_or_repo_id=path_or_repo_id
+            )
+        except FileNotFoundError:
+            pass


You should initialize self._system_prompt as None here; otherwise, trying to access it later will cause an error.

Or, maybe it's not used anywhere?

I would recommend using contextlib.suppress instead of the try / expect block.

It's not used anywhere. I'm deleting this block but keeping the utility function for now in case we want to revisit this. I think we don't want to default add system blocks without user's knowledge

djsaunde · 2025-06-11T18:22:56Z

+from mistral_common.tokens.tokenizers.mistral import (
+    MistralTokenizer as _MistralTokenizer,
+)


Why change the name?

djsaunde · 2025-06-11T18:23:21Z

+            raise NotImplementedError("Revision not supported yet")
+
+        # check if tokenizer_config is a valid local path
+        base = _MistralTokenizer.from_file(


This class supports .from_hf_hub as well now

Yep. For now, I'll keep this for loading tekken only. I think we need a function for find_tokenizer_in_folder like they have in from_hf_hub. We need the local tokenizer path to copy over the tokenizer during save_pretrained.

djsaunde · 2025-06-11T18:33:56Z

+
+        return self._mistral.decode(tokens)
+
+    def pad(


stupid question: can't we call transformers code here instead of writing this ourselves? I admit I'm not super familiar with this area

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/axolotl/utils/mistral_tokenizer.py (2)

43-48: Docstring still references non-existent tokenizer parameter
This was pointed out previously and remains unchanged. Replace tokenizer with mistral for accuracy.

256-275: Recursive decode persists
The unnecessary recursion for int inputs is still present. A single-line wrap is
simpler and avoids an extra call stack frame.

🧹 Nitpick comments (1)

src/axolotl/utils/mistral_tokenizer.py (1)
550-556: Minor: iterate dict directly

for key in f.keys(): is verbose and flagged by Ruff. Iterating over the dict is clearer:
-            for key in f.keys():
+            for key in f:
🧰 Tools

🪛 Ruff (0.11.9)

552-552: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db3474e and f62f16d.

📒 Files selected for processing (1)

src/axolotl/utils/mistral_tokenizer.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

552-552: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)

NanoCode012 · 2025-06-11T19:59:29Z

TODOs left:

Check pad handling
Unit test for Magistral tokenizer (mistral-common)
Smoke test for Magistral training

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

src/axolotl/utils/mistral_tokenizer.py (3)
42-46: Fix docstring to match actual parameters.

The docstring refers to incorrect parameter names and is missing documentation for tokenizer_path.
         """
         Args:
-            mistral: The mistral-common tokenizer to wrap.
-            name_or_path: The name or path to the tokenizer files or the repo id.
+            mistral: The MistralTokenizer instance to wrap.
+            name_or_path: The path to the tokenizer files or HuggingFace repository ID.
+            tokenizer_path: The path to the tokenizer file (tekken.json).
         """
342-542: Consider leveraging transformers' padding utilities.

This is a comprehensive padding implementation, but it might be duplicating functionality already available in transformers. Have you evaluated whether transformers.PreTrainedTokenizer.pad could be used or adapted instead?

🧰 Tools

🪛 Ruff (0.11.9)

523-523: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

21-31: ⚠️ Potential issue

Critical: _get_file_path returns directory instead of file path.

When a local directory is provided, the function returns the directory path instead of the full file path, causing errors when the path is used to open or copy files.
 def _get_file_path(path_or_repo_id: str, filename: str) -> str:
     """Get the file path from local or HF Hub"""
     if os.path.exists(path_or_repo_id):
         maybe_file_path = os.path.join(path_or_repo_id, filename)
         if os.path.exists(maybe_file_path):
-            return path_or_repo_id
+            return maybe_file_path
 
         raise FileNotFoundError(f"File not found at {path_or_repo_id}")
 
     return hf_hub_download(repo_id=path_or_repo_id, filename=filename)

🧹 Nitpick comments (2)

src/axolotl/utils/schemas/config.py (1)

1271-1283: Simplify nested if statements and clarify the warning message.

The nested if statements can be combined, and the warning message should better explain why explicit setting is preferred.

-        if data.get("tokenizer_use_mistral_common") is None:
-            if any(
-                "magistral" in name.lower()
-                for name in [
-                    data.get("base_model", ""),
-                    data.get("base_model_config", ""),
-                    data.get("tokenizer_config", ""),
-                ]
-            ):
-                LOG.warning(
-                    "tokenizer_use_mistral_common auto inferred to True for Magistral models. Please set it to True explicitly if you want to use mistral-common tokenizer."
-                )
-                data["tokenizer_use_mistral_common"] = True
+        if data.get("tokenizer_use_mistral_common") is None and any(
+            "magistral" in name.lower()
+            for name in [
+                data.get("base_model", ""),
+                data.get("base_model_config", ""),
+                data.get("tokenizer_config", ""),
+            ]
+        ):
+            LOG.warning(
+                "Detected Magistral model - auto-enabling mistral-common tokenizer. "
+                "Set `tokenizer_use_mistral_common: true` explicitly in your config to remove this warning."
+            )
+            data["tokenizer_use_mistral_common"] = True

🧰 Tools

🪛 Ruff (0.11.9)

1271-1279: Use a single if statement instead of nested if statements

(SIM102)

src/axolotl/utils/mistral_tokenizer.py (1)

520-527: Simplify dictionary key iteration.

Remove unnecessary .keys() when iterating over dictionary keys.

         # Handle non-sequence fields (raise error)
         sequence_fields = {"input_ids", "labels", "attention_mask", "position_ids"}
         for f in features:
-            for key in f.keys():
+            for key in f:
                 if key not in sequence_fields:
                     raise NotImplementedError(
                         f"Non-sequence field {key} not handled yet"
                     )

🧰 Tools

🪛 Ruff (0.11.9)

523-523: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f62f16d and 6f862b6.

📒 Files selected for processing (5)

src/axolotl/datasets.py (1 hunks)
src/axolotl/loaders/tokenizer.py (1 hunks)
src/axolotl/prompt_strategies/chat_template.py (7 hunks)
src/axolotl/utils/mistral_tokenizer.py (1 hunks)
src/axolotl/utils/schemas/config.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/axolotl/datasets.py
src/axolotl/loaders/tokenizer.py
src/axolotl/prompt_strategies/chat_template.py

🧰 Additional context used

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

523-523: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

src/axolotl/utils/schemas/config.py

1271-1279: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: pre-commit
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: preview

🔇 Additional comments (2)

src/axolotl/utils/schemas/config.py (1)

1300-1328: Well-structured incompatibility checks.

The validator clearly identifies and reports incompatible options with helpful error messages.

src/axolotl/utils/mistral_tokenizer.py (1)

183-204: Excellent error handling in save_pretrained.

The method includes comprehensive error checking and provides detailed error messages that will help with debugging.

coderabbitai · 2025-06-11T20:07:34Z

+        maybe_tokenizer_path = os.path.join(name_or_path, "tekken.json")
+        if os.path.exists(maybe_tokenizer_path):
+            base = MistralTokenizer.from_file(maybe_tokenizer_path)
+        else:
+            base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision)
+
+        return cls(
+            base,
+            name_or_path=name_or_path,
+            tokenizer_path=os.path.join(name_or_path, "tekken.json"),
+        )


🛠️ Refactor suggestion

Use _get_file_path helper and handle HF Hub case for tokenizer_path.

The method hardcodes the tekken.json path and assumes it's always local, which won't work correctly when loading from HuggingFace Hub.

- maybe_tokenizer_path = os.path.join(name_or_path, "tekken.json") - if os.path.exists(maybe_tokenizer_path): - base = MistralTokenizer.from_file(maybe_tokenizer_path) - else: - base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision) + try: + tokenizer_path = _get_file_path(name_or_path, "tekken.json") + base = MistralTokenizer.from_file(tokenizer_path) + except FileNotFoundError: + # Fall back to HF Hub + base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision) + # For HF Hub case, we'll need to download the file for save_pretrained + tokenizer_path = hf_hub_download(repo_id=name_or_path, filename="tekken.json", revision=revision) return cls( base, name_or_path=name_or_path, - tokenizer_path=os.path.join(name_or_path, "tekken.json"), + tokenizer_path=tokenizer_path, )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/axolotl/utils/mistral_tokenizer.py lines 159 to 169, replace the hardcoded "tekken.json" path with the _get_file_path helper function to correctly resolve the tokenizer_path. Also, ensure that when loading from the HuggingFace Hub, the tokenizer_path is set appropriately to handle remote files instead of assuming a local path. This will fix issues with loading the tokenizer from the hub.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

src/axolotl/utils/mistral_tokenizer.py (3)

295-313: Guard against missing description / parameters keys when building Tool objects

function["description"] will raise KeyError if the caller omits it. Better to use .get() with sane defaults, mirroring HF-style behaviour.

-        function = tool["function"]
+        function = tool["function"]
 
-        tool_calls.append(
-            Tool(
-                function=Function(
-                    name=function["name"],
-                    description=function["description"],
-                    # set parameters to empty dict if not provided
-                    parameters=function.get("parameters", {}),
-                )
-            )
-        )
+        tool_calls.append(
+            Tool(
+                function=Function(
+                    name=function["name"],
+                    description=function.get("description", ""),
+                    parameters=function.get("parameters", {}),
+                )
+            )
+        )

363-366: token_type_ids check can be simplified & faster

Ruff hint (SIM118): iterate over the dict directly rather than .keys().

-if any("token_type_ids" in f for f in features):
+if any("token_type_ids" in f for f in features):

You can drop .keys() entirely:

-if any("token_type_ids" in f.keys() for f in features):
+if any("token_type_ids" in f for f in features):

534-542: Prefer explicit integer dtype over deprecated np.long

np.long is an alias that NumPy discourages. Use np.int64 (or np.int_) for clarity and forward compatibility.

-                    result[k] = v.numpy().astype(np.long)
+                    result[k] = v.numpy().astype(np.int64)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f862b6 and a2509fb.

📒 Files selected for processing (1)

src/axolotl/utils/mistral_tokenizer.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

528-528: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: pre-commit
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: preview
GitHub Check: pre-commit

coderabbitai · 2025-06-11T23:20:29Z

+def _get_file_path(path_or_repo_id: str, filename: str) -> str:
+    """Get the file path from local or HF Hub"""
+    if os.path.exists(path_or_repo_id):
+        maybe_file_path = os.path.join(path_or_repo_id, filename)
+        if os.path.exists(maybe_file_path):
+            return maybe_file_path
+
+        raise FileNotFoundError(f"File not found at {path_or_repo_id}")
+
+    return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
+


🛠️ Refactor suggestion

_get_file_path silently fails for direct file paths & reports confusing error paths

If the caller passes an absolute file path (.../tekken.json) this helper will still append filename, producing .../tekken.json/tekken.json and inevitably fail on the subsequent hf_hub_download.

The FileNotFoundError raised when the file is missing points to the directory (path_or_repo_id) instead of the missing file, making debugging harder.

def _get_file_path(path_or_repo_id: str, filename: str) -> str: """Get the file path from local or HF Hub""" - if os.path.exists(path_or_repo_id): - maybe_file_path = os.path.join(path_or_repo_id, filename) - if os.path.exists(maybe_file_path): - return maybe_file_path - - raise FileNotFoundError(f"File not found at {path_or_repo_id}") - - return hf_hub_download(repo_id=path_or_repo_id, filename=filename) + # ① Caller passed the file itself + if os.path.isfile(path_or_repo_id): + return path_or_repo_id + + # ② Caller passed a directory – check for <dir>/<filename> + if os.path.isdir(path_or_repo_id): + file_path = os.path.join(path_or_repo_id, filename) + if os.path.isfile(file_path): + return file_path + raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'") + + # ③ Fallback to HF Hub + return hf_hub_download(repo_id=path_or_repo_id, filename=filename)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _get_file_path(path_or_repo_id: str, filename: str) -> str:

"""Get the file path from local or HF Hub"""

if os.path.exists(path_or_repo_id):

maybe_file_path = os.path.join(path_or_repo_id, filename)

if os.path.exists(maybe_file_path):

return maybe_file_path

raise FileNotFoundError(f"File not found at {path_or_repo_id}")

return hf_hub_download(repo_id=path_or_repo_id, filename=filename)

def _get_file_path(path_or_repo_id: str, filename: str) -> str:

"""Get the file path from local or HF Hub"""

# ① Caller passed the file itself

if os.path.isfile(path_or_repo_id):

return path_or_repo_id

# ② Caller passed a directory – check for <dir>/<filename>

if os.path.isdir(path_or_repo_id):

file_path = os.path.join(path_or_repo_id, filename)

if os.path.isfile(file_path):

return file_path

raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'")

# ③ Fallback to HF Hub

return hf_hub_download(repo_id=path_or_repo_id, filename=filename)

🤖 Prompt for AI Agents

In src/axolotl/utils/mistral_tokenizer.py around lines 21 to 31, the _get_file_path function incorrectly handles cases where path_or_repo_id is a direct file path by appending filename, causing invalid paths and failures. Also, the FileNotFoundError message references the directory instead of the missing file. Fix this by first checking if path_or_repo_id is a file path and returning it directly if so. Update the error message to include the full missing file path for clearer debugging.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/axolotl/utils/mistral_tokenizer.py (1)

21-30: ⚠️ Potential issue

_get_file_path mishandles direct file paths & emits misleading errors

Passing an absolute path to tekken.json currently produces an invalid path
<file>/tekken.json and the function falls back to HF Hub.
Additionally, FileNotFoundError points to the directory, not the missing file,
which complicates debugging.

-def _get_file_path(path_or_repo_id: str, filename: str) -> str:
-    """Get the file path from local or HF Hub"""
-    if os.path.exists(path_or_repo_id):
-        maybe_file_path = os.path.join(path_or_repo_id, filename)
-        if os.path.exists(maybe_file_path):
-            return maybe_file_path
-
-        raise FileNotFoundError(f"File not found at {path_or_repo_id}")
-
-    return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
+def _get_file_path(path_or_repo_id: str, filename: str) -> str:
+    """Return *filename* from (a) direct file path, (b) local dir, or (c) HF Hub."""
+    # (a) direct file path
+    if os.path.isfile(path_or_repo_id):
+        return path_or_repo_id
+
+    # (b) local directory – expect <dir>/<filename>
+    if os.path.isdir(path_or_repo_id):
+        file_path = os.path.join(path_or_repo_id, filename)
+        if os.path.isfile(file_path):
+            return file_path
+        raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'")
+
+    # (c) fallback: treat *path_or_repo_id* as HF Hub repo-id
+    return hf_hub_download(repo_id=path_or_repo_id, filename=filename)

🧹 Nitpick comments (3)

src/axolotl/utils/mistral_tokenizer.py (3)
373-400: Avoid needless padding work when no padding requested

Even with padding=False|DO_NOT_PAD, the method still constructs padded tensors
(via pad_sequence) and later trims/extends them.
Early-returning in that branch would be simpler and markedly faster for
large batches.

Consider:
if padding in (False, "do_not_pad", PaddingStrategy.DO_NOT_PAD):
    if return_tensors == "pt":
        return {k: torch.tensor([f[k] for f in features]) for k in features[0]}
    # handle "np" the same way …
This change keeps logic linear and avoids multiple passes over the data.

533-535: Drop redundant .keys() call

Iterating over a dict automatically yields its keys; using .keys() is noisier
and marginally slower.
-    for f in features:
-        for key in f.keys():
+    for f in features:
+        for key in f:
🧰 Tools

🪛 Ruff (0.11.9)

534-534: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

541-549: Replace deprecated np.long with explicit np.int64

np.long is an alias that NumPy plans to deprecate; using the explicit dtype is
future-proof and clearer.
-                    result[k] = v.numpy().astype(np.long)
+                    result[k] = v.numpy().astype(np.int64)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2509fb and 153a4b9.

📒 Files selected for processing (1)

src/axolotl/utils/mistral_tokenizer.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/axolotl/utils/mistral_tokenizer.py (2)

src/axolotl/convert.py (1)

read (12-14)

src/axolotl/core/chat/messages.py (1)

Tool (59-66)

🪛 Ruff (0.11.9)

src/axolotl/utils/mistral_tokenizer.py

534-534: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest (3.11, 2.5.1)
GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
GitHub Check: PyTest (3.11, 2.7.0)
GitHub Check: pre-commit
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
GitHub Check: preview

winglian

Good to ship for release, but let's add a follow-on for adding some unit tests.

NanoCode012 · 2025-06-12T21:21:28Z

Added tests for the mistral common tokenizer.

The model itself is just a Mistral model so no need duplicate smoke tests.

NanoCode012 requested review from djsaunde and winglian June 11, 2025 02:51

NanoCode012 commented Jun 11, 2025

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2025

View reviewed changes

NanoCode012 force-pushed the feat/mistral-common branch from d62078e to 23ec193 Compare June 11, 2025 02:56

coderabbitai Bot reviewed Jun 11, 2025

View reviewed changes

Comment thread src/axolotl/utils/mistral_tokenizer.py Outdated

winglian force-pushed the feat/mistral-common branch from e693b73 to db3474e Compare June 11, 2025 18:22

coderabbitai Bot reviewed Jun 11, 2025

View reviewed changes

djsaunde reviewed Jun 11, 2025

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2025

View reviewed changes

Comment thread src/axolotl/utils/mistral_tokenizer.py

winglian reviewed Jun 11, 2025

View reviewed changes

Comment thread src/axolotl/datasets.py

coderabbitai Bot reviewed Jun 11, 2025

View reviewed changes

winglian approved these changes Jun 12, 2025

View reviewed changes

NanoCode012 added 11 commits June 12, 2025 10:30

feat: add mistral-common support

f1b635c

fix: do not access tokenizer prop

67f1ff8

feat: refactor to use wrapper

28ff436

fix: use shutil copyfile instead of symlink

43bdb62

chore: refactor handling of file path

cc98d78

chore: refactor

4f1aec9

feat: add pad function to tokenizer

abae6ce

feat: add magistral

4816612

chore: refactor

11f94ff

chore: refactor

6c494be

fix: move to optional dependency

737c523

NanoCode012 added 18 commits June 12, 2025 10:30

chore: comments

e2fd4e1

feat: auto infer tokenizer_use_mistral_common for magistral

ae4019e

fix: comments

f8002b8

fix: use proper internal function

dab9c1e

chore: refactor

2b45ab6

feat: add convert_ids_to_tokens api

e770727

fix: comments

88308ae

fix: import from comments

22a3b83

fix: get_file_path for local handling

cb2d561

feat: add log on disabling multiprocessing

48f1dde

fix: from_pretrained to load hub link properly

d8a4622

fix: change assert to raise

7e59a82

feat: add magistral to docs

46d2e0d

fix: update readme doc

b994384

fix: wrong function

4316f60

fix: error import message

4b6b915

feat: move to core dependency

8ca0c13

feat: add mistral-common tokenizer test

770f97a

NanoCode012 force-pushed the feat/mistral-common branch from 153a4b9 to 770f97a Compare June 12, 2025 19:01

NanoCode012 added 2 commits June 12, 2025 12:15

fix; lint

237dc3d

feat: add vram usage

f280f5a

winglian merged commit eac4a61 into axolotl-ai-cloud:main Jun 12, 2025
28 of 32 checks passed

This was referenced Jun 13, 2025

Fix: adding magistral fsdp config, fixing not eval with test_datasets, handle mllama attention #2789

Merged

Config doc autogen #2718

Merged

coderabbitai Bot mentioned this pull request Jul 8, 2025

Feat: add devstral model support #2880

Merged

coderabbitai Bot mentioned this pull request Jul 19, 2025

make the initial call to tokenizer.pad not spam the console #2946

Merged

coderabbitai Bot mentioned this pull request Jul 27, 2025

Feat: Add voxtral, magistral small 1.1, and misc gemma3n fixes #2979

Merged

coderabbitai Bot mentioned this pull request Aug 13, 2025

Various fixes for VLMs #3063

Merged

coderabbitai Bot mentioned this pull request Aug 26, 2025

Streaming SFT support #3101

Merged

6 tasks

	if cfg.is_mistral_derived_model and cfg.flash_attention and not cfg.sample_packing:
	tokenizer.padding_side = "left"

Uh oh!

Conversation

NanoCode012 commented Jun 11, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

djsaunde left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

NanoCode012 commented Jun 11, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov Bot commented Jun 11, 2025 •

edited

Loading