Feat: Add Magistral and mistral-common tokenizer support#2780
Conversation
WalkthroughThis update introduces support for the mistral-common tokenizer, including a HuggingFace-compatible wrapper, configuration options, and specialized prompt strategies. It adds validation logic, documentation, and example configuration for fine-tuning Magistral models. Multiprocessing is conditionally disabled for tokenizers that do not support it, and optional dependencies are updated to include mistral-common. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Config
participant Loader as TokenizerLoader
participant Mistral as HFMistralTokenizer
User->>Config: Set tokenizer_use_mistral_common = True
Config->>Loader: load_tokenizer(cfg)
alt mistral-common enabled
Loader->>Mistral: from_pretrained(cfg.tokenizer_config)
Mistral-->>Loader: HFMistralTokenizer instance
Loader-->>Config: Return mistral-common tokenizer
else default
Loader->>Loader: Load generic tokenizer
Loader-->>Config: Return generic tokenizer
end
sequenceDiagram
participant Trainer
participant Dataset
participant Tokenizer
participant Strategy
Trainer->>Dataset: process()
Dataset->>Tokenizer: supports_multiprocessing?
alt supports multiprocessing
Dataset->>Strategy: Map with multiprocessing
else does not support
Dataset->>Strategy: Map with num_proc=1 (no multiprocessing)
end
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
|
||
| return self._mistral.decode(tokens) | ||
|
|
||
| def pad( |
There was a problem hiding this comment.
Would love eyes on this section. I think I'm duplicating code / not as efficient padding.
There was a problem hiding this comment.
stupid question: can't we call transformers code here instead of writing this ourselves? I admit I'm not super familiar with this area
There was a problem hiding this comment.
I would love to if there's one. Do you know whether there's one?
There was a problem hiding this comment.
transformers.PreTrainedTokenizer.pad exists but I'm not sure if / how much of the functionality overlaps with yours.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (12)
docs/config.qmd (1)
30-31: Clarify default value & limitations oftokenizer_use_mistral_common.Readers don’t know whether the flag defaults to
false, what side-effects exist (e.g. disables multiprocessing), or that it requires themistral-commondep added in requirements.txt. Consider expanding the inline comment:-# Whether to use mistral-common tokenizer. If set to True, it will use the mistral-common tokenizer. +# Whether to use the `mistral-common` tokenizer (default: false). +# NOTE: • Forces single-process tokenization (mistral tokenizer isn’t picklable) +# • Requires `mistral-common>=1.6.0` +# • Overrides `tokenizer_type` / `tokenizer_use_fast`src/axolotl/utils/schemas/model.py (1)
21-21: Add schema metadata & explicit default for the new field.
tokenizer_use_mistral_commonlacks theField(..., json_schema_extra={...})used for neighbouring attributes, and its implicitNonedefault may complicate downstream boolean logic. Recommend:- tokenizer_use_mistral_common: bool | None = None + tokenizer_use_mistral_common: bool | None = Field( + default=None, + json_schema_extra={ + "description": "Enable support for the mistral-common tokenizer" + }, + )This keeps schema docs consistent and makes intent explicit.
src/axolotl/datasets.py (1)
52-55: Good safeguard, but expose intent via a helper propertyThe guard works, but burying it inline makes future maintenance harder. Consider pushing this logic into
PromptTokenizingStrategyitself:- if not getattr(self.prompt_tokenizer, "supports_multiprocessing", True): - num_proc = 1 + if not self.prompt_tokenizer.supports_multiprocessing: + num_proc = 1…and add the property with a default
Truein the base class. This removes the need for agetattrfallback every time the flag is checked.src/axolotl/utils/schemas/config.py (1)
1260-1302: Validator added 🎉 – minor edge-case & phrasing nits
The field
tokenizer_use_mistral_commonmust exist on the model. Double-check thatModelInputConfigactually defines it; otherwise Pydantic will never run this validator.Message wording:
- “mistral-common is required for mistral models” → might confuse users tuning non-Mistral models but enabling the flag. Suggest something clearer.
- "mistral-common is required for mistral models. Please install it with `pip install mistral-common`." + "The `mistral-common` package is required when `tokenizer_use_mistral_common: true`. " + "Install it with `pip install mistral-common`."Purely cosmetic, but improves UX.
examples/magistral/README.md (1)
40-45: Minor wording / tone polishA couple of tiny style tweaks flagged by LanguageTool:
-We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment and for `type: chat_template` only. +Currently, only the `mistral-common` tokenizer is supported for supervised fine-tuning with `type: chat_template`.Not mandatory, just readability improvements.
🧰 Tools
🪛 LanguageTool
[style] ~42-~42: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...(AT_THE_MOMENT)
src/axolotl/prompt_strategies/chat_template.py (3)
664-666: Clean up or document commented code.Either remove this commented debug log or add a comment explaining why it's disabled. Commented code without context becomes technical debt.
- LOG.debug(f"Content boundaries: {start_idx}, {end_idx}") - # LOG.debug( - # f"Content tokens: {self.tokenizer.convert_ids_to_tokens(full_ids[start_idx:end_idx])}" - # ) + LOG.debug(f"Content boundaries: {start_idx}, {end_idx}") + # TODO: Enable token debugging when needed - currently disabled due to verbosity + # LOG.debug( + # f"Content tokens: {self.tokenizer.convert_ids_to_tokens(full_ids[start_idx:end_idx])}" + # )
862-865: Address the TODO comment.The TODO comment indicates this validation bypass is temporary. Please create a tracking issue or provide a timeline for implementing mistral-specific validation checks.
Would you like me to help design appropriate validation logic for mistral-common tokenizers or create a GitHub issue to track this technical debt?
875-881: Remove commented-out code.This commented method appears to be unused. Please remove it to keep the codebase clean.
- # def find_first_eos_token(self, input_ids, start_idx): - # eos_token_id = self.tokenizer.instruct_tokenizer.tokenizer.eos_id - # for i in range(start_idx, len(input_ids)): - # if input_ids[i] == eos_token_id: - # return i - # return -1 -src/axolotl/utils/mistral_tokenizer.py (4)
51-56: Use contextlib.suppress for cleaner exception handling.Replace the try-except-pass pattern with contextlib.suppress as suggested by static analysis.
+from contextlib import suppress + self._tokenizer_path = _get_file_path(path_or_repo_id, "tekken.json") # Try to load system prompt if available - try: + with suppress(FileNotFoundError): self._system_prompt = self._load_system_prompt( path_or_repo_id=path_or_repo_id ) - except FileNotFoundError: - pass🧰 Tools
🪛 Ruff (0.11.9)
51-56: Use
contextlib.suppress(FileNotFoundError)instead oftry-except-passReplace with
contextlib.suppress(FileNotFoundError)(SIM105)
67-67: Document the protected member access.Add a comment explaining why accessing the protected
_special_token_policyis necessary.is_tekken = isinstance(tokenizer_, Tekkenizer) if is_tekken: + # We need to modify the special token policy to ensure special tokens are preserved + # during decoding. This is a necessary workaround for mistral-common tokenizer behavior. tokenizer_._special_token_policy = SpecialTokenPolicy.KEEP # type: ignore # pylint: disable=protected-access
470-470: Simplify dictionary membership check.Remove unnecessary
.keys()call as suggested by static analysis.- if key not in sequence_fields: + if key not in sequence_fields:Actually, looking at the code more carefully, the current line is already correct. The issue is on line 470 which should be:
- for key in f.keys(): + for key in f:🧰 Tools
🪛 Ruff (0.11.9)
470-470: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
396-397: Address or remove the TODO comment.The TODO comment about checking if trimming is needed should be addressed or removed if not applicable.
Would you like me to help implement the trimming logic or determine if it's necessary for this use case?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
docs/config.qmd(1 hunks)examples/magistral/README.md(1 hunks)examples/magistral/magistral-small-qlora.yaml(1 hunks)requirements.txt(1 hunks)src/axolotl/datasets.py(1 hunks)src/axolotl/loaders/model.py(1 hunks)src/axolotl/loaders/tokenizer.py(2 hunks)src/axolotl/prompt_strategies/chat_template.py(8 hunks)src/axolotl/prompt_tokenizers.py(1 hunks)src/axolotl/utils/mistral_tokenizer.py(1 hunks)src/axolotl/utils/schemas/config.py(1 hunks)src/axolotl/utils/schemas/model.py(1 hunks)src/axolotl/utils/tokenization.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/axolotl/utils/tokenization.py (1)
src/axolotl/utils/mistral_tokenizer.py (1)
decode(182-189)
src/axolotl/loaders/tokenizer.py (2)
src/axolotl/utils/dict.py (1)
DictDefault(6-38)src/axolotl/utils/mistral_tokenizer.py (2)
HFMistralTokenizer(33-489)from_pretrained(120-138)
src/axolotl/prompt_tokenizers.py (1)
src/axolotl/prompt_strategies/chat_template.py (1)
supports_multiprocessing(867-873)
🪛 LanguageTool
examples/magistral/README.md
[style] ~24-~24: Consider using polite language here.
Context: ...gistral/magistral-small-qlora.yaml ``` Let us know how it goes. Happy finetuning! 🚀 ### ...
(INSERT_PLEASE)
[style] ~42-~42: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...
(AT_THE_MOMENT)
[uncategorized] ~58-~58: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass
Replace with contextlib.suppress(FileNotFoundError)
(SIM105)
470-470: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
🔇 Additional comments (5)
requirements.txt (1)
70-72:Details
✅ Verification successful
Check version pin & transitive-dependency compatibility for
mistral-common.
mistral-common==1.6.0is introduced without a companion upper-bound or environment marker.
Given the large set of tokenizer / transformers–adjacent deps already pinned, please verify that:
mistral-common 1.6.0is compatible withtransformers==4.52.3shipped a few lines above.- No conflicting indirect pins are pulled in (e.g.
tokenizers,sentencepiece,cattrs, etc.).Run:
🏁 Script executed:
#!/usr/bin/env bash # Show mistral-common requirements and spot conflicts with existing pins python - <<'PY' import importlib.metadata as md, json, sys, subprocess, re, textwrap, os try: import pkg_resources, pathlib, sysconfig except ImportError: import sys; sys.exit("setuptools not available") req = subprocess.check_output(["python", "-m", "pip", "index", "versions", "mistral-common"]) print(req.decode()) PYLength of output: 347
🏁 Script executed:
#!/usr/bin/env bash # Fetch mistral-common 1.6.0 dependencies from PyPI and list pinned transformer/tokenizer-related deps python - <<'PYTHON' import urllib.request, json data = json.load(urllib.request.urlopen('https://pypi.org/pypi/mistral-common/1.6.0/json')) reqs = data.get('info', {}).get('requires_dist') or [] print("mistral-common 1.6.0 requires_dist:") for r in reqs: print(" ", r) PYTHON echo -e "\nPinned versions in requirements.txt:" grep -R "transformers==" -n requirements.txt || echo " transformers not pinned" grep -R "tokenizers==" -n requirements.txt || echo " tokenizers not pinned" grep -R "sentencepiece==" -n requirements.txt || echo " sentencepiece not pinned" grep -R "cattrs==" -n requirements.txt || echo " cattrs not pinned"Length of output: 3047
🏁 Script executed:
#!/usr/bin/env bash set -e # Clean target directory rm -rf tmp_pkg mkdir tmp_pkg # Install mistral-common into a temporary directory without dependencies pip install mistral-common==1.6.0 --no-deps --target tmp_pkg # Find METADATA file DISTINFO_DIR=$(find tmp_pkg -maxdepth 1 -type d -name "mistral_common-*.dist-info" | head -n 1) METADATA_FILE="$DISTINFO_DIR/METADATA" echo "=== mistral-common 1.6.0 Requires-Dist ===" grep '^Requires-Dist:' "$METADATA_FILE" || echo " (no Requires-Dist entries found)" echo echo "=== Pinned versions in requirements.txt ===" grep -n "transformers==" requirements.txt || echo " transformers not pinned" grep -n "tokenizers==" requirements.txt || echo " tokenizers not pinned" grep -n "sentencepiece==" requirements.txt || echo " sentencepiece not pinned" grep -n "cattrs==" requirements.txt || echo " cattrs not pinned"Length of output: 1665
No conflicts detected for mistral-common 1.6.0
mistral-common 1.6.0 does not require
transformers, so it won’t override or conflict with your pinnedtransformers==4.52.3. Its only overlapping requirement is:•
sentencepiece>=0.2.0(no upper bound pin in your requirements)It does not pull in
tokenizersorcattrs. You can safely keep:# other dependencies -mistral-common==1.6.0 +mistral-common==1.6.0src/axolotl/prompt_tokenizers.py (1)
62-69: Good addition—enables runtime check for pickling support.The default
supports_multiprocessing = Trueand override inMistralStrategycleanly gate multiprocessing usage. No issues spotted.examples/magistral/magistral-small-qlora.yaml (1)
1-64: LGTM!The configuration file is well-structured and follows best practices for QLoRA fine-tuning with appropriate hyperparameters.
src/axolotl/prompt_strategies/chat_template.py (2)
89-89: Good type hint improvement!Making the
conversationparameter type more specific improves code clarity and IDE support.
279-287: Good defensive programming!The added check for
eos_tokenattribute prevents potential AttributeError with tokenizers that don't have this attribute.
| def _load_mistral_common_tokenizer(cfg: DictDefault): | ||
| """Load mistral-common tokenizer""" | ||
| assert ( | ||
| cfg.tokenizer_use_mistral_common | ||
| ), "tokenizer_use_mistral_common must be True" | ||
|
|
||
| from axolotl.utils.mistral_tokenizer import HFMistralTokenizer | ||
|
|
||
| # Load the HF-compatible wrapper around MistralTokenizer | ||
| tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config) | ||
|
|
||
| return tokenizer | ||
|
|
||
| if cfg.tokenizer_use_mistral_common: | ||
| return _load_mistral_common_tokenizer(cfg) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Pass-through of common tokenizer kwargs is lost
_load_mistral_common_tokenizer ignores flags that the generic path handles (e.g. trust_remote_code, revision). If the user sets them in the config they silently do nothing.
- tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config)
+ extra_kwargs = {}
+ if cfg.revision_of_model:
+ extra_kwargs["revision"] = cfg.revision_of_model
+ if cfg.trust_remote_code is not None:
+ extra_kwargs["trust_remote_code"] = cfg.trust_remote_code
+
+ tokenizer = HFMistralTokenizer.from_pretrained(
+ cfg.tokenizer_config,
+ **extra_kwargs,
+ )Forwarding these keeps behaviour parity with the standard HF path.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _load_mistral_common_tokenizer(cfg: DictDefault): | |
| """Load mistral-common tokenizer""" | |
| assert ( | |
| cfg.tokenizer_use_mistral_common | |
| ), "tokenizer_use_mistral_common must be True" | |
| from axolotl.utils.mistral_tokenizer import HFMistralTokenizer | |
| # Load the HF-compatible wrapper around MistralTokenizer | |
| tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config) | |
| return tokenizer | |
| if cfg.tokenizer_use_mistral_common: | |
| return _load_mistral_common_tokenizer(cfg) | |
| def _load_mistral_common_tokenizer(cfg: DictDefault): | |
| """Load mistral-common tokenizer""" | |
| assert ( | |
| cfg.tokenizer_use_mistral_common | |
| ), "tokenizer_use_mistral_common must be True" | |
| from axolotl.utils.mistral_tokenizer import HFMistralTokenizer | |
| # Load the HF-compatible wrapper around MistralTokenizer | |
| extra_kwargs = {} | |
| if cfg.revision_of_model: | |
| extra_kwargs["revision"] = cfg.revision_of_model | |
| if cfg.trust_remote_code is not None: | |
| extra_kwargs["trust_remote_code"] = cfg.trust_remote_code | |
| tokenizer = HFMistralTokenizer.from_pretrained( | |
| cfg.tokenizer_config, | |
| **extra_kwargs, | |
| ) | |
| return tokenizer | |
| if cfg.tokenizer_use_mistral_common: | |
| return _load_mistral_common_tokenizer(cfg) |
🤖 Prompt for AI Agents
In src/axolotl/loaders/tokenizer.py around lines 124 to 139, the
_load_mistral_common_tokenizer function does not forward common tokenizer
keyword arguments like trust_remote_code and revision from the config, causing
them to be ignored. Modify the function to accept and pass these additional
kwargs to HFMistralTokenizer.from_pretrained by extracting them from cfg or
accepting **kwargs, ensuring consistent behavior with the standard HF tokenizer
loading path.
| # Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation | ||
| # pylint: disable=non-parent-init-called | ||
| # pylint: disable=super-init-not-called | ||
| PromptTokenizingStrategy.__init__( | ||
| self, prompter, tokenizer, train_on_inputs, sequence_len | ||
| ) |
There was a problem hiding this comment.
Document the validation bypass rationale.
Bypassing the parent's _validate_eot_and_eos_tokens() validation could lead to runtime issues. Please add a detailed comment explaining why this validation doesn't apply to mistral-common tokenizers and what safeguards are in place.
# Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation
# pylint: disable=non-parent-init-called
# pylint: disable=super-init-not-called
+ # NOTE: We bypass ChatTemplateStrategy's validation because mistral-common tokenizers
+ # handle special tokens differently and don't expose them in the same way as HuggingFace tokenizers.
+ # This is safe because [explain why it's safe or what alternative validation is done].
PromptTokenizingStrategy.__init__(
self, prompter, tokenizer, train_on_inputs, sequence_len
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation | |
| # pylint: disable=non-parent-init-called | |
| # pylint: disable=super-init-not-called | |
| PromptTokenizingStrategy.__init__( | |
| self, prompter, tokenizer, train_on_inputs, sequence_len | |
| ) | |
| # Call the parent's parent __init__ (PromptTokenizingStrategy) to skip ChatTemplateStrategy's validation | |
| # pylint: disable=non-parent-init-called | |
| # pylint: disable=super-init-not-called | |
| # NOTE: We bypass ChatTemplateStrategy's validation because mistral-common tokenizers | |
| # handle special tokens differently and don't expose them in the same way as HuggingFace tokenizers. | |
| # This is safe because [explain why it's safe or what alternative validation is done]. | |
| PromptTokenizingStrategy.__init__( | |
| self, prompter, tokenizer, train_on_inputs, sequence_len | |
| ) |
🤖 Prompt for AI Agents
In src/axolotl/prompt_strategies/chat_template.py around lines 827 to 832, add a
detailed comment above the call to PromptTokenizingStrategy.__init__ explaining
why the _validate_eot_and_eos_tokens() validation is intentionally bypassed for
mistral-common tokenizers. Clarify that this validation does not apply due to
specific tokenizer characteristics and describe any safeguards or alternative
checks in place to prevent runtime issues, ensuring the rationale is clear for
future maintainers.
d62078e to
23ec193
Compare
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/axolotl/utils/mistral_tokenizer.py (1)
182-189: Simplify decode method to avoid recursion.The recursive call is unnecessary. Wrap the int in a list inline instead.
def decode(self, ids: int | list[int], skip_special_tokens: bool = True) -> str: if not skip_special_tokens: raise NotImplementedError("skip_special_tokens not supported yet") if isinstance(ids, int): - return self.decode([ids]) + ids = [ids] return self._mistral.instruct_tokenizer.tokenizer.decode(ids)
🧹 Nitpick comments (10)
src/axolotl/utils/mistral_tokenizer.py (4)
51-56: Usecontextlib.suppressfor cleaner exception handling.+from contextlib import suppress + # Try to load system prompt if available - try: + with suppress(FileNotFoundError): self._system_prompt = self._load_system_prompt( path_or_repo_id=path_or_repo_id ) - except FileNotFoundError: - pass🧰 Tools
🪛 Ruff (0.11.9)
51-56: Use
contextlib.suppress(FileNotFoundError)instead oftry-except-passReplace with
contextlib.suppress(FileNotFoundError)(SIM105)
396-397: Address the TODO comment about trimming.The TODO comment indicates uncertainty about whether trimming is needed and if it's correct. This should be resolved before merging.
Would you like me to help analyze the trimming logic and determine if it's needed?
470-470: Simplify dictionary membership check.- if key not in sequence_fields: + if key not in sequence_fields:Wait, I need to look at this more carefully. The line is checking
key in f.keys()according to static analysis, but the code shows it differently. Let me re-read...Actually, looking at the code:
for key in f.keys(): if key not in sequence_fields:The static analysis is correct - this should be simplified.
- for key in f.keys(): + for key in f: if key not in sequence_fields:🧰 Tools
🪛 Ruff (0.11.9)
470-470: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
289-490: Consider refactoring the pad method for better maintainability.This method is quite complex with over 200 lines handling various padding scenarios. Consider breaking it down into smaller helper methods for each tensor type (input_ids, labels, attention_mask, position_ids) to improve readability and maintainability.
Would you like me to help refactor this method into smaller, more focused helper functions?
🧰 Tools
🪛 Ruff (0.11.9)
470-470: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
examples/magistral/README.md (6)
3-3: Use correct term “open-source”
For accuracy and consistency, change “opensource” to “open-source.”- Magistral Small is a 24B parameter opensource model from… + Magistral Small is a 24B parameter open-source model from…
16-16: Correct casing for PyTorch
Use the official “PyTorch” spelling.- # Ensure you have Pytorch installed (we recommend Pytorch 2.6.0) + # Ensure you have PyTorch installed (we recommend PyTorch 2.6.0)
36-36: Add a polite prompt
Consider adding “Please” for a more friendly tone.- Let us know how it goes. Happy finetuning! 🚀 + Please let us know how it goes. Happy fine-tuning! 🚀
41-41: Hyphenate “fine-tuning”
Maintain consistency by using “fine-tuning” with a hyphen.- You can run a full finetuning by removing the `adapter: qlora`… + You can run a full fine-tuning by removing the `adapter: qlora`…
53-53: Replace “at the moment” with “currently”
For conciseness and clarity, use “currently.”- We only support the `mistral-common` tokenizer for Supervised Fine-tuning at the moment… + We only support the `mistral-common` tokenizer for Supervised Fine-tuning currently…🧰 Tools
🪛 LanguageTool
[style] ~53-~53: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...(AT_THE_MOMENT)
69-69: Insert missing comma
Add a comma after “configs” for clarity.- Add parity to other tokenizer configs like overriding tokens. + Add parity to other tokenizer configs, like overriding tokens.🧰 Tools
🪛 LanguageTool
[uncategorized] ~69-~69: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.(AI_HYDRA_LEO_MISSING_COMMA)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
docs/config.qmd(1 hunks)examples/magistral/README.md(1 hunks)examples/magistral/magistral-small-qlora.yaml(1 hunks)setup.py(1 hunks)src/axolotl/datasets.py(1 hunks)src/axolotl/loaders/model.py(1 hunks)src/axolotl/loaders/tokenizer.py(1 hunks)src/axolotl/prompt_strategies/chat_template.py(8 hunks)src/axolotl/prompt_tokenizers.py(1 hunks)src/axolotl/utils/mistral_tokenizer.py(1 hunks)src/axolotl/utils/schemas/config.py(1 hunks)src/axolotl/utils/schemas/model.py(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- setup.py
- docs/config.qmd
- src/axolotl/loaders/model.py
🚧 Files skipped from review as they are similar to previous changes (7)
- src/axolotl/prompt_tokenizers.py
- src/axolotl/datasets.py
- src/axolotl/utils/schemas/model.py
- src/axolotl/loaders/tokenizer.py
- src/axolotl/utils/schemas/config.py
- examples/magistral/magistral-small-qlora.yaml
- src/axolotl/prompt_strategies/chat_template.py
🧰 Additional context used
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
51-56: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass
Replace with contextlib.suppress(FileNotFoundError)
(SIM105)
470-470: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
🪛 LanguageTool
examples/magistral/README.md
[style] ~35-~35: Consider using polite language here.
Context: ...gistral/magistral-small-qlora.yaml ``` Let us know how it goes. Happy finetuning! 🚀 ### ...
(INSERT_PLEASE)
[style] ~53-~53: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. Th...
(AT_THE_MOMENT)
[uncategorized] ~69-~69: Possible missing comma found.
Context: ...l, etc. - Add parity to other tokenizer configs like overriding tokens.
(AI_HYDRA_LEO_MISSING_COMMA)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
🔇 Additional comments (1)
examples/magistral/README.md (1)
21-21: Verify extra dependency group name
The example uses themistralextra, but the PR introduces amistral-commontokenizer package. Please confirm the correct extras group insetup.py.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/axolotl/utils/mistral_tokenizer.py (1)
214-221: Simplify decode method to avoid recursion.The recursive call is unnecessary. Wrap the int in a list inline instead.
def decode(self, ids: int | list[int], skip_special_tokens: bool = True) -> str: if not skip_special_tokens: raise NotImplementedError("skip_special_tokens not supported yet") if isinstance(ids, int): - return self.decode([ids]) + ids = [ids] return self._mistral.instruct_tokenizer.tokenizer.decode(ids)
🧹 Nitpick comments (4)
src/axolotl/utils/mistral_tokenizer.py (4)
69-73: Reliance on private APIs poses maintenance risk.The code accesses private attributes
_special_token_policyand_modefrom the mistral-common library. These could change without notice in future versions.Consider:
- Requesting public APIs from the mistral-common maintainers for these features
- Adding version pinning and tests to detect breaking changes
- Documenting these dependencies clearly
Also applies to: 84-94
154-160: Consider warning about unused kwargs.The method accepts
**kwargsbut silently ignores them. Users might pass unsupported parameters without realizing they're ignored.@classmethod def from_pretrained( cls, path_or_repo_id: str, *, revision: Optional[str] = None, - **kwargs, # pylint: disable=unused-argument + **kwargs, ) -> "HFMistralTokenizer": """ Download a mistral tokenizer from HF Hub and wrap it. """ + if kwargs: + import warnings + warnings.warn(f"Ignoring unsupported arguments: {list(kwargs.keys())}")
424-428: Address the TODO comment about trimming.The TODO comment suggests uncertainty about whether trimming is needed and if it's correct. This should be resolved before merging.
Would you like me to analyze the trimming logic and determine if it's correctly implemented?
498-498: Simplify dictionary key check.- if key not in sequence_fields: + for key in f: + if key not in sequence_fields:🧰 Tools
🪛 Ruff (0.11.9)
498-498: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/mistral_tokenizer.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
56-61: Use contextlib.suppress(FileNotFoundError) instead of try-except-pass
Replace with contextlib.suppress(FileNotFoundError)
(SIM105)
498-498: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: pre-commit
- GitHub Check: pre-commit
- GitHub Check: preview
🔇 Additional comments (2)
src/axolotl/utils/mistral_tokenizer.py (2)
317-518: Padding implementation looks correct.The padding logic properly handles:
- Different padding strategies (longest, max_length, do_not_pad)
- Position IDs continuation for right padding
- Proper padding values for each tensor type
- Conversion to requested tensor format
The implementation aligns with HuggingFace's padding behavior.
🧰 Tools
🪛 Ruff (0.11.9)
498-498: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
26-35:Details
❌ Incorrect review comment
Consider adding validation for the filename parameter.
The function assumes
filenameis safe to use in path operations. Consider validating the filename to prevent potential path traversal issues.
🏁 Script executed:
#!/bin/bash # Check if there are any other usages of _get_file_path that might pass unsafe filenames ast-grep --pattern '_get_file_path($_, $_)'Length of output: 382
No additional filename validation needed
The
_get_file_pathfunction is only ever called with static literals ("tekken.json","SYSTEM_PROMPT.txt") within this module. There’s no user-controlled input flowing into thefilenameparameter, so path traversal isn’t a concern and no extra validation is required.Likely an incorrect or invalid review comment.
e693b73 to
db3474e
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/axolotl/integrations/kd/chat_template.py (2)
192-194: Add type annotation and clarify future-proofing
cfgis introduced in the signature but immediately discarded.
Giving the parameter a type and documenting the rationale will (a) silence type-checkers, (b) avoid the need for a pylint override, and (c) make it clear that branching oncfgis planned for later.-def _get_strategy_cls(self, cfg): # pylint: disable=unused-argument - return ChatTemplateStrategyWithKD +def _get_strategy_cls( + self, cfg: Any # noqa: D401 – kept for future config-based branching +) -> type[ChatTemplateStrategyWithKD]: + # NOTE: For now the KD path always uses ChatTemplateStrategyWithKD. + # The `cfg` argument is accepted so that we can introduce + # conditional selection (e.g. MistralStrategyWithKD) without an + # interface change later. + return ChatTemplateStrategyWithKD
195-203: Makekd_temperatureextraction resilient to non-dict config objects
cfgis treated as a mapping (cfg.get) but elsewhere in the codebase it can be a dataclass orAttrDict.
Falling back togetattrprevents anAttributeErrorin such cases.- if kd_temperature := cfg.get("kd_temperature"): - strategy_params["kd_temperature"] = kd_temperature + kd_temperature = ( + cfg.get("kd_temperature") # type: ignore[attr-defined] + if hasattr(cfg, "get") + else getattr(cfg, "kd_temperature", None) + ) + if kd_temperature is not None: + strategy_params["kd_temperature"] = kd_temperature
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/integrations/kd/chat_template.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/integrations/kd/chat_template.py (2)
src/axolotl/prompt_strategies/chat_template.py (1)
_get_strategy_cls(897-901)src/axolotl/integrations/base.py (2)
cfg(319-320)cfg(323-324)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: pre-commit
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: preview
djsaunde
left a comment
There was a problem hiding this comment.
Can we add a smoke test?
Unit tests would also be nice but that could come later.
| if tokenizer_use_mistral_common: | ||
| try: | ||
| import mistral_common # noqa: F401 # pylint:disable=unused-import | ||
| except ImportError as exception: | ||
| raise ImportError( | ||
| "mistral-common is required for mistral models. Please install it with `pip install mistral-common`." | ||
| ) from exception |
There was a problem hiding this comment.
Do we need this guard? I see you added mistral_common to our requirements.txt so this shouldn't get hit.
There was a problem hiding this comment.
Yes, I moved it to optional dependency like you've previously done.
| # Whether to use the legacy tokenizer setting, defaults to True | ||
| tokenizer_legacy: | ||
| # Whether to use mistral-common tokenizer. If set to True, it will use the mistral-common tokenizer. | ||
| tokenizer_use_mistral_common: |
There was a problem hiding this comment.
I think this is fine for now, but it might be better to auto-detect needing the mistral common tokenizer as this can be a stumbling point for users.
There was a problem hiding this comment.
Yup, I agree here. I'll add a simple name check for now
| from axolotl.utils.schemas.datasets import DatasetConfig | ||
|
|
||
| if TYPE_CHECKING: | ||
| from axolotl.utils.mistral_tokenizer import HFMistralTokenizer |
| path: The path to the tokenizer files. | ||
| """ | ||
| self._mistral = mistral | ||
| self._padding_side = "right" |
There was a problem hiding this comment.
stupid question: doesn't the mistral tokenizer pad from the left?
There was a problem hiding this comment.
I'm not sure about this too. For fine-tuning, it makes sense to be right? Anyways, I think the below would override it either way. Not sure if this block still needs it.
axolotl/src/axolotl/loaders/tokenizer.py
Lines 174 to 175 in 00cda8c
| # Try to load system prompt if available | ||
| try: | ||
| self._system_prompt = self._load_system_prompt( | ||
| path_or_repo_id=path_or_repo_id | ||
| ) | ||
| except FileNotFoundError: | ||
| pass |
There was a problem hiding this comment.
You should initialize self._system_prompt as None here; otherwise, trying to access it later will cause an error.
Or, maybe it's not used anywhere?
I would recommend using contextlib.suppress instead of the try / expect block.
There was a problem hiding this comment.
It's not used anywhere. I'm deleting this block but keeping the utility function for now in case we want to revisit this. I think we don't want to default add system blocks without user's knowledge
| from mistral_common.tokens.tokenizers.mistral import ( | ||
| MistralTokenizer as _MistralTokenizer, | ||
| ) |
| raise NotImplementedError("Revision not supported yet") | ||
|
|
||
| # check if tokenizer_config is a valid local path | ||
| base = _MistralTokenizer.from_file( |
There was a problem hiding this comment.
This class supports .from_hf_hub as well now
There was a problem hiding this comment.
Yep. For now, I'll keep this for loading tekken only. I think we need a function for find_tokenizer_in_folder like they have in from_hf_hub. We need the local tokenizer path to copy over the tokenizer during save_pretrained.
|
|
||
| return self._mistral.decode(tokens) | ||
|
|
||
| def pad( |
There was a problem hiding this comment.
stupid question: can't we call transformers code here instead of writing this ourselves? I admit I'm not super familiar with this area
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/axolotl/utils/mistral_tokenizer.py (2)
43-48: Docstring still references non-existenttokenizerparameter
This was pointed out previously and remains unchanged. Replacetokenizerwithmistralfor accuracy.
256-275: Recursivedecodepersists
The unnecessary recursion forintinputs is still present. A single-line wrap is
simpler and avoids an extra call stack frame.
🧹 Nitpick comments (1)
src/axolotl/utils/mistral_tokenizer.py (1)
550-556: Minor: iterate dict directly
for key in f.keys():is verbose and flagged by Ruff. Iterating over the dict is clearer:- for key in f.keys(): + for key in f:🧰 Tools
🪛 Ruff (0.11.9)
552-552: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/mistral_tokenizer.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
552-552: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
|
TODOs left:
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/axolotl/utils/mistral_tokenizer.py (3)
42-46: Fix docstring to match actual parameters.The docstring refers to incorrect parameter names and is missing documentation for
tokenizer_path.""" Args: - mistral: The mistral-common tokenizer to wrap. - name_or_path: The name or path to the tokenizer files or the repo id. + mistral: The MistralTokenizer instance to wrap. + name_or_path: The path to the tokenizer files or HuggingFace repository ID. + tokenizer_path: The path to the tokenizer file (tekken.json). """
342-542: Consider leveraging transformers' padding utilities.This is a comprehensive padding implementation, but it might be duplicating functionality already available in transformers. Have you evaluated whether
transformers.PreTrainedTokenizer.padcould be used or adapted instead?🧰 Tools
🪛 Ruff (0.11.9)
523-523: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
21-31:⚠️ Potential issueCritical: _get_file_path returns directory instead of file path.
When a local directory is provided, the function returns the directory path instead of the full file path, causing errors when the path is used to open or copy files.
def _get_file_path(path_or_repo_id: str, filename: str) -> str: """Get the file path from local or HF Hub""" if os.path.exists(path_or_repo_id): maybe_file_path = os.path.join(path_or_repo_id, filename) if os.path.exists(maybe_file_path): - return path_or_repo_id + return maybe_file_path raise FileNotFoundError(f"File not found at {path_or_repo_id}") return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
🧹 Nitpick comments (2)
src/axolotl/utils/schemas/config.py (1)
1271-1283: Simplify nested if statements and clarify the warning message.The nested if statements can be combined, and the warning message should better explain why explicit setting is preferred.
- if data.get("tokenizer_use_mistral_common") is None: - if any( - "magistral" in name.lower() - for name in [ - data.get("base_model", ""), - data.get("base_model_config", ""), - data.get("tokenizer_config", ""), - ] - ): - LOG.warning( - "tokenizer_use_mistral_common auto inferred to True for Magistral models. Please set it to True explicitly if you want to use mistral-common tokenizer." - ) - data["tokenizer_use_mistral_common"] = True + if data.get("tokenizer_use_mistral_common") is None and any( + "magistral" in name.lower() + for name in [ + data.get("base_model", ""), + data.get("base_model_config", ""), + data.get("tokenizer_config", ""), + ] + ): + LOG.warning( + "Detected Magistral model - auto-enabling mistral-common tokenizer. " + "Set `tokenizer_use_mistral_common: true` explicitly in your config to remove this warning." + ) + data["tokenizer_use_mistral_common"] = True🧰 Tools
🪛 Ruff (0.11.9)
1271-1279: Use a single
ifstatement instead of nestedifstatements(SIM102)
src/axolotl/utils/mistral_tokenizer.py (1)
520-527: Simplify dictionary key iteration.Remove unnecessary
.keys()when iterating over dictionary keys.# Handle non-sequence fields (raise error) sequence_fields = {"input_ids", "labels", "attention_mask", "position_ids"} for f in features: - for key in f.keys(): + for key in f: if key not in sequence_fields: raise NotImplementedError( f"Non-sequence field {key} not handled yet" )🧰 Tools
🪛 Ruff (0.11.9)
523-523: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/axolotl/datasets.py(1 hunks)src/axolotl/loaders/tokenizer.py(1 hunks)src/axolotl/prompt_strategies/chat_template.py(7 hunks)src/axolotl/utils/mistral_tokenizer.py(1 hunks)src/axolotl/utils/schemas/config.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/axolotl/datasets.py
- src/axolotl/loaders/tokenizer.py
- src/axolotl/prompt_strategies/chat_template.py
🧰 Additional context used
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
523-523: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
src/axolotl/utils/schemas/config.py
1271-1279: Use a single if statement instead of nested if statements
(SIM102)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: pre-commit
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: preview
🔇 Additional comments (2)
src/axolotl/utils/schemas/config.py (1)
1300-1328: Well-structured incompatibility checks.The validator clearly identifies and reports incompatible options with helpful error messages.
src/axolotl/utils/mistral_tokenizer.py (1)
183-204: Excellent error handling in save_pretrained.The method includes comprehensive error checking and provides detailed error messages that will help with debugging.
| maybe_tokenizer_path = os.path.join(name_or_path, "tekken.json") | ||
| if os.path.exists(maybe_tokenizer_path): | ||
| base = MistralTokenizer.from_file(maybe_tokenizer_path) | ||
| else: | ||
| base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision) | ||
|
|
||
| return cls( | ||
| base, | ||
| name_or_path=name_or_path, | ||
| tokenizer_path=os.path.join(name_or_path, "tekken.json"), | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use _get_file_path helper and handle HF Hub case for tokenizer_path.
The method hardcodes the tekken.json path and assumes it's always local, which won't work correctly when loading from HuggingFace Hub.
- maybe_tokenizer_path = os.path.join(name_or_path, "tekken.json")
- if os.path.exists(maybe_tokenizer_path):
- base = MistralTokenizer.from_file(maybe_tokenizer_path)
- else:
- base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision)
+ try:
+ tokenizer_path = _get_file_path(name_or_path, "tekken.json")
+ base = MistralTokenizer.from_file(tokenizer_path)
+ except FileNotFoundError:
+ # Fall back to HF Hub
+ base = MistralTokenizer.from_hf_hub(name_or_path, revision=revision)
+ # For HF Hub case, we'll need to download the file for save_pretrained
+ tokenizer_path = hf_hub_download(repo_id=name_or_path, filename="tekken.json", revision=revision)
return cls(
base,
name_or_path=name_or_path,
- tokenizer_path=os.path.join(name_or_path, "tekken.json"),
+ tokenizer_path=tokenizer_path,
)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/axolotl/utils/mistral_tokenizer.py lines 159 to 169, replace the
hardcoded "tekken.json" path with the _get_file_path helper function to
correctly resolve the tokenizer_path. Also, ensure that when loading from the
HuggingFace Hub, the tokenizer_path is set appropriately to handle remote files
instead of assuming a local path. This will fix issues with loading the
tokenizer from the hub.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/axolotl/utils/mistral_tokenizer.py (3)
295-313: Guard against missingdescription/parameterskeys when buildingToolobjects
function["description"]will raiseKeyErrorif the caller omits it. Better to use.get()with sane defaults, mirroring HF-style behaviour.- function = tool["function"] + function = tool["function"] - tool_calls.append( - Tool( - function=Function( - name=function["name"], - description=function["description"], - # set parameters to empty dict if not provided - parameters=function.get("parameters", {}), - ) - ) - ) + tool_calls.append( + Tool( + function=Function( + name=function["name"], + description=function.get("description", ""), + parameters=function.get("parameters", {}), + ) + ) + )
363-366:token_type_idscheck can be simplified & fasterRuff hint (SIM118): iterate over the dict directly rather than
.keys().-if any("token_type_ids" in f for f in features): +if any("token_type_ids" in f for f in features):You can drop
.keys()entirely:-if any("token_type_ids" in f.keys() for f in features): +if any("token_type_ids" in f for f in features):
534-542: Prefer explicit integer dtype over deprecatednp.long
np.longis an alias that NumPy discourages. Usenp.int64(ornp.int_) for clarity and forward compatibility.- result[k] = v.numpy().astype(np.long) + result[k] = v.numpy().astype(np.int64)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/mistral_tokenizer.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
528-528: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: pre-commit
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: preview
- GitHub Check: pre-commit
| def _get_file_path(path_or_repo_id: str, filename: str) -> str: | ||
| """Get the file path from local or HF Hub""" | ||
| if os.path.exists(path_or_repo_id): | ||
| maybe_file_path = os.path.join(path_or_repo_id, filename) | ||
| if os.path.exists(maybe_file_path): | ||
| return maybe_file_path | ||
|
|
||
| raise FileNotFoundError(f"File not found at {path_or_repo_id}") | ||
|
|
||
| return hf_hub_download(repo_id=path_or_repo_id, filename=filename) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
_get_file_path silently fails for direct file paths & reports confusing error paths
- If the caller passes an absolute file path (
.../tekken.json) this helper will still appendfilename, producing.../tekken.json/tekken.jsonand inevitably fail on the subsequenthf_hub_download. - The
FileNotFoundErrorraised when the file is missing points to the directory (path_or_repo_id) instead of the missing file, making debugging harder.
def _get_file_path(path_or_repo_id: str, filename: str) -> str:
"""Get the file path from local or HF Hub"""
- if os.path.exists(path_or_repo_id):
- maybe_file_path = os.path.join(path_or_repo_id, filename)
- if os.path.exists(maybe_file_path):
- return maybe_file_path
-
- raise FileNotFoundError(f"File not found at {path_or_repo_id}")
-
- return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
+ # ① Caller passed the file itself
+ if os.path.isfile(path_or_repo_id):
+ return path_or_repo_id
+
+ # ② Caller passed a directory – check for <dir>/<filename>
+ if os.path.isdir(path_or_repo_id):
+ file_path = os.path.join(path_or_repo_id, filename)
+ if os.path.isfile(file_path):
+ return file_path
+ raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'")
+
+ # ③ Fallback to HF Hub
+ return hf_hub_download(repo_id=path_or_repo_id, filename=filename)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_file_path(path_or_repo_id: str, filename: str) -> str: | |
| """Get the file path from local or HF Hub""" | |
| if os.path.exists(path_or_repo_id): | |
| maybe_file_path = os.path.join(path_or_repo_id, filename) | |
| if os.path.exists(maybe_file_path): | |
| return maybe_file_path | |
| raise FileNotFoundError(f"File not found at {path_or_repo_id}") | |
| return hf_hub_download(repo_id=path_or_repo_id, filename=filename) | |
| def _get_file_path(path_or_repo_id: str, filename: str) -> str: | |
| """Get the file path from local or HF Hub""" | |
| # ① Caller passed the file itself | |
| if os.path.isfile(path_or_repo_id): | |
| return path_or_repo_id | |
| # ② Caller passed a directory – check for <dir>/<filename> | |
| if os.path.isdir(path_or_repo_id): | |
| file_path = os.path.join(path_or_repo_id, filename) | |
| if os.path.isfile(file_path): | |
| return file_path | |
| raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'") | |
| # ③ Fallback to HF Hub | |
| return hf_hub_download(repo_id=path_or_repo_id, filename=filename) |
🤖 Prompt for AI Agents
In src/axolotl/utils/mistral_tokenizer.py around lines 21 to 31, the
_get_file_path function incorrectly handles cases where path_or_repo_id is a
direct file path by appending filename, causing invalid paths and failures.
Also, the FileNotFoundError message references the directory instead of the
missing file. Fix this by first checking if path_or_repo_id is a file path and
returning it directly if so. Update the error message to include the full
missing file path for clearer debugging.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/axolotl/utils/mistral_tokenizer.py (1)
21-30:⚠️ Potential issue
_get_file_pathmishandles direct file paths & emits misleading errorsPassing an absolute path to
tekken.jsoncurrently produces an invalid path
<file>/tekken.jsonand the function falls back to HF Hub.
Additionally,FileNotFoundErrorpoints to the directory, not the missing file,
which complicates debugging.-def _get_file_path(path_or_repo_id: str, filename: str) -> str: - """Get the file path from local or HF Hub""" - if os.path.exists(path_or_repo_id): - maybe_file_path = os.path.join(path_or_repo_id, filename) - if os.path.exists(maybe_file_path): - return maybe_file_path - - raise FileNotFoundError(f"File not found at {path_or_repo_id}") - - return hf_hub_download(repo_id=path_or_repo_id, filename=filename) +def _get_file_path(path_or_repo_id: str, filename: str) -> str: + """Return *filename* from (a) direct file path, (b) local dir, or (c) HF Hub.""" + # (a) direct file path + if os.path.isfile(path_or_repo_id): + return path_or_repo_id + + # (b) local directory – expect <dir>/<filename> + if os.path.isdir(path_or_repo_id): + file_path = os.path.join(path_or_repo_id, filename) + if os.path.isfile(file_path): + return file_path + raise FileNotFoundError(f"Expected file '{filename}' not found at '{file_path}'") + + # (c) fallback: treat *path_or_repo_id* as HF Hub repo-id + return hf_hub_download(repo_id=path_or_repo_id, filename=filename)
🧹 Nitpick comments (3)
src/axolotl/utils/mistral_tokenizer.py (3)
373-400: Avoid needless padding work when no padding requestedEven with
padding=False|DO_NOT_PAD, the method still constructs padded tensors
(viapad_sequence) and later trims/extends them.
Early-returning in that branch would be simpler and markedly faster for
large batches.Consider:
if padding in (False, "do_not_pad", PaddingStrategy.DO_NOT_PAD): if return_tensors == "pt": return {k: torch.tensor([f[k] for f in features]) for k in features[0]} # handle "np" the same way …This change keeps logic linear and avoids multiple passes over the data.
533-535: Drop redundant.keys()callIterating over a dict automatically yields its keys; using
.keys()is noisier
and marginally slower.- for f in features: - for key in f.keys(): + for f in features: + for key in f:🧰 Tools
🪛 Ruff (0.11.9)
534-534: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
541-549: Replace deprecatednp.longwith explicitnp.int64
np.longis an alias that NumPy plans to deprecate; using the explicit dtype is
future-proof and clearer.- result[k] = v.numpy().astype(np.long) + result[k] = v.numpy().astype(np.int64)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/mistral_tokenizer.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/utils/mistral_tokenizer.py (2)
src/axolotl/convert.py (1)
read(12-14)src/axolotl/core/chat/messages.py (1)
Tool(59-66)
🪛 Ruff (0.11.9)
src/axolotl/utils/mistral_tokenizer.py
534-534: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: pre-commit
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: preview
winglian
left a comment
There was a problem hiding this comment.
Good to ship for release, but let's add a follow-on for adding some unit tests.
153a4b9 to
770f97a
Compare
|
Added tests for the mistral common tokenizer. The model itself is just a Mistral model so no need duplicate smoke tests. |
Description
Add Magistral and mistral-common package support.
Motivation and Context
How has this been tested?
Preprocess ran locally and manual training runs on cloud.
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Refactor