Feat: add devstral model support#2880
Conversation
WalkthroughSupport for the Devstral model from MistralAI was added, including documentation, configuration for QLoRA fine-tuning, and comprehensive tests. Several code changes were made to generalize Mistral model support, remove the multiprocessing restriction for tokenization, and improve padding and chat template handling. Related documentation was updated for clarity. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Axolotl
participant HFMistralTokenizer
participant ChatCompletionRequest
User->>Axolotl: Initiate fine-tuning with Devstral config
Axolotl->>HFMistralTokenizer: Tokenize chat messages (apply_chat_template)
HFMistralTokenizer->>ChatCompletionRequest: from_openai(messages, tools)
ChatCompletionRequest-->>HFMistralTokenizer: Chat completion request object
HFMistralTokenizer-->>Axolotl: Tokenized input
Axolotl-->>User: Fine-tuning proceeds with tokenized data
Assessment against linked issues
Assessment against linked issues: Out-of-scope changesNo out-of-scope changes found. Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (6)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
examples/devstral/devstral-small-qlora.yml (1)
64-65: Incompletespecial_tokensconfigurationThe
special_tokensfield on line 65 is missing its value. This should either be removed if not needed, or completed with the appropriate special token mappings.Apply this diff to remove the incomplete field:
weight_decay: 0.0 -special_tokens:Or complete it with appropriate special tokens if needed:
weight_decay: 0.0 -special_tokens: +special_tokens: + pad_token: "<pad>"
🧹 Nitpick comments (1)
examples/devstral/README.md (1)
1-70: Comprehensive documentation for new Devstral model support.The README provides excellent documentation for the new Devstral model, including detailed installation instructions, configuration examples, and proper limitations disclosure. The structure follows the existing pattern established by other model examples.
However, there are several minor grammatical and formatting issues that should be addressed:
-Devstral Small is a 24B parameter opensource model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505). +Devstral Small is a 24B parameter open-source model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505). -The model was fine-tuned ontop of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of upto 128k tokens. +The model was fine-tuned on top of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of up to 128k tokens. -You need to install from main as Devstral is only on nightly or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html). +You need to install from main as Devstral is only on nightly, or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/devstral/README.md(1 hunks)examples/devstral/devstral-small-qlora.yml(1 hunks)examples/magistral/README.md(3 hunks)src/axolotl/datasets.py(0 hunks)src/axolotl/prompt_strategies/chat_template.py(1 hunks)src/axolotl/prompt_tokenizers.py(0 hunks)src/axolotl/utils/collators/batching.py(1 hunks)src/axolotl/utils/mistral_tokenizer.py(6 hunks)tests/prompt_strategies/conftest.py(1 hunks)tests/prompt_strategies/test_chat_templates_mistral.py(6 hunks)
💤 Files with no reviewable changes (2)
- src/axolotl/prompt_tokenizers.py
- src/axolotl/datasets.py
🧰 Additional context used
🧠 Learnings (1)
examples/magistral/README.md (1)
Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.
🪛 LanguageTool
examples/magistral/README.md
[grammar] ~24-~24: There might be a mistake here.
Context: ...tion -e '.[flash-attn]' 2. Run the finetuning example: bash axolotl train example...
(QB_NEW_EN_OTHER)
[grammar] ~24-~24: Use proper spacing conventions.
Context: ...tn]' 2. Run the finetuning example: bash axolotl train examples/magistral/magistral-small-qlora.yaml ``` This config uses about 24GB VRAM. Let u...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~39-~39: Use proper spacing conventions.
Context: ...ormats/conversation.html#chat_template). ## Optimization Guides - [Multi-GPU Traini...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~51-~51: Use proper spacing conventions.
Context: ...we do not support overriding tokens yet. ## Related Resources - [MistralAI Magistra...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
examples/devstral/README.md
[grammar] ~1-~1: Use proper spacing conventions.
Context: # Finetune Devstral with Axolotl Devstral Small is a 24B parameter openso...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~3-~3: There might be a mistake here.
Context: ...lotl Devstral Small is a 24B parameter opensource model from MistralAI found on HuggingFa...
(QB_NEW_EN_OTHER)
[grammar] ~3-~3: Combining words like “every day” changes the meaning.
Context: ...pensource model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingfa...
(QB_NEW_EN_OTHER_ERROR_IDS_000001)
[grammar] ~3-~3: Use proper spacing conventions.
Context: ...-turn conversations with proper masking. The model was fine-tuned ontop of [Mistr...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~5-~5: Combining words like “every day” changes the meaning.
Context: ...oper masking. The model was fine-tuned ontop of [Mistral-Small-3.1](https://huggingf...
(QB_NEW_EN_OTHER_ERROR_IDS_000001)
[grammar] ~5-~5: Combining words like “every day” changes the meaning.
Context: ...t the vision layer and has a context of upto 128k tokens. ## Getting started 1. In...
(QB_NEW_EN_OTHER_ERROR_IDS_000001)
[grammar] ~5-~5: Use proper spacing conventions.
Context: ...r and has a context of upto 128k tokens. ## Getting started 1. Install Axolotl foll...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~7-~7: Use proper spacing conventions.
Context: ...of upto 128k tokens. ## Getting started 1. Install Axolotl following the [installat...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~9-~9: There might be a mistake here.
Context: ...llation.html). You need to install from main as Devstral is only on nightly or use o...
(QB_NEW_EN_OTHER)
[grammar] ~9-~9: Correctly pair commas and coordinating conjunctions.
Context: ...nstall from main as Devstral is only on nightly or use our latest [Docker images](https...
(QB_NEW_EN_OTHER_ERROR_IDS_000073)
[grammar] ~9-~9: Use proper spacing conventions.
Context: ...tps://docs.axolotl.ai/docs/docker.html). Here is an example of how to install fro...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~11-~11: Use proper spacing conventions.
Context: ...ple of how to install from main for pip: bash # Ensure you have Pytorch installed (Pytorch 2.6.0+) git clone https://github.com/axolotl-ai-cloud/axolotl.git cd axolotl pip3 install packaging==23.2 setuptools==75.8.0 wheel ninja pip3 install --no-build-isolation -e '.[flash-attn]' # Install the latest mistral-common from source pip3 uninstall mistral-common pip3 install git+https://github.com/mistralai/mistral-common.git@039465d 2. Run the finetuning example: ```bash axo...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~27-~27: There might be a mistake here.
Context: ...ral-common.git@039465d 2. Run the finetuning example: bash axolotl train example...
(QB_NEW_EN_OTHER)
[grammar] ~27-~27: Use proper spacing conventions.
Context: ...65d 2. Run the finetuning example: bash axolotl train examples/devstral/devstral-small-qlora.yml ``` This config uses about 21GB VRAM. Let u...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~33-~33: There might be a mistake here.
Context: ...l-qlora.yml ``` This config uses about 21GB VRAM. Let us know how it goes. Happy...
(QB_NEW_EN_OTHER)
[grammar] ~33-~33: Use proper spacing conventions.
Context: ...l ``` This config uses about 21GB VRAM. Let us know how it goes. Happy finetunin...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[style] ~34-~34: Consider using polite language here.
Context: ...``` This config uses about 21GB VRAM. Let us know how it goes. Happy finetuning! 🚀 ### ...
(INSERT_PLEASE)
[grammar] ~35-~35: There might be a mistake here.
Context: ...B VRAM. Let us know how it goes. Happy finetuning! 🚀 ### TIPS - You can run a full fin...
(QB_NEW_EN_OTHER)
[grammar] ~35-~35: Use proper spacing conventions.
Context: ...s know how it goes. Happy finetuning! 🚀 ### TIPS - You can run a full finetuning by...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~37-~37: Use proper spacing conventions.
Context: ... it goes. Happy finetuning! 🚀 ### TIPS - You can run a full finetuning by removin...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~39-~39: There might be a mistake here.
Context: ...ing! 🚀 ### TIPS - You can run a full finetuning by removing the adapter: qlora and `l...
(QB_NEW_EN_OTHER)
[grammar] ~39-~39: Use proper spacing conventions.
Context: ...nd load_in_4bit: true from the config. - Read more on how to load your own datase...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~40-~40: Use proper spacing conventions.
Context: ...s.axolotl.ai/docs/dataset_loading.html). - The dataset format follows the OpenAI Me...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~41-~41: Use proper spacing conventions.
Context: ...ormats/conversation.html#chat_template). ## Optimization Guides - [Multi-GPU Traini...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~43-~43: Use proper spacing conventions.
Context: ...#chat_template). ## Optimization Guides - [Multi-GPU Training](https://docs.axolotl...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~49-~49: Use proper spacing conventions.
Context: ....html#cut-cross-entropy) - Liger Kernel ## Limitations We only support the `mistra...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~51-~51: Use proper spacing conventions.
Context: ...ions.html#liger-kernels) ## Limitations We only support the mistral-common tok...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[style] ~53-~53: For conciseness, consider replacing this expression with an adverb.
Context: ...ntokenizer for Supervised Fine-tuning at the moment and fortype: chat_template` only. In...
(AT_THE_MOMENT)
[grammar] ~53-~53: Use proper spacing conventions.
Context: ...ment and for type: chat_template only. In addition, we do not support overridin...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~55-~55: Use proper spacing conventions.
Context: ...we do not support overriding tokens yet. ## Related Resources - [MistralAI Devstral...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~57-~57: Use proper spacing conventions.
Context: ...riding tokens yet. ## Related Resources - [MistralAI Devstral Blog](https://mistral...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~63-~63: Use proper spacing conventions.
Context: ...](https://axolotl.ai) - Axolotl Discord ## Future Work - Add parity to Preference ...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~66-~66: Use proper spacing conventions.
Context: .../discord.gg/7m9sfhzaf3) ## Future Work - Add parity to Preference Tuning, RL, Mul...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~68-~68: Use proper spacing conventions.
Context: ...Preference Tuning, RL, Multi-modal, etc. - Add parity to other tokenizer configs li...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~69-~69: Use proper spacing conventions.
Context: ...okenizer configs like overriding tokens.
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
🪛 Ruff (0.11.9)
tests/prompt_strategies/test_chat_templates_mistral.py
427-427: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
438-438: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: pre-commit
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: preview
- GitHub Check: pre-commit
🔇 Additional comments (11)
tests/prompt_strategies/conftest.py (1)
167-172: LGTM! Consistent test fixture implementation.The new
devstral_tokenizerfixture follows the same pattern as the existingmagistral_tokenizerfixture and correctly references the Devstral model specified in the PR objectives.src/axolotl/utils/collators/batching.py (1)
111-112: Excellent defensive programming improvement.The refined condition prevents potential KeyError exceptions by only deleting the
attention_maskkey when it wasn't originally present but was added during padding. This aligns with the PR's goal of fixing padding-related bugs.examples/magistral/README.md (3)
21-21: Good simplification of installation requirements.Removing the
mistralextra from the pip install command aligns with the PR's simplification of Mistral tokenizer handling.
39-39: Minor grammar improvement.The change from "The dataset format is" to "The dataset format follows" improves clarity.
51-51: Correctly reflects removal of multiprocessing limitations.Removing the mention of tokenizer multiprocessing limitations aligns with the PR's removal of the multiprocessing workaround for MistralTokenizer pickling issues.
src/axolotl/prompt_strategies/chat_template.py (1)
684-691: Good improvement to avoid None value pollution.The change to only add
trainingandtraining_detailfields when they're not None prevents unnecessary key-value pairs with None values in the turn dictionary. This is cleaner and more efficient than the previous approach.tests/prompt_strategies/test_chat_templates_mistral.py (2)
12-306: Well-structured parameterized test implementationThe conversion to pytest parameterization is excellent, allowing comprehensive testing of multiple tokenizer variants. The test coverage for chat templates, system prompts, and tool usage is thorough.
443-754: Comprehensive tool calling test coverageExcellent test coverage for tool calling functionality, including single/multiple tool calls, system messages, and error handling for incomplete tool responses.
src/axolotl/utils/mistral_tokenizer.py (3)
274-276: Good simplification of chat completion request creationUsing
ChatCompletionRequest.from_openaiis cleaner and delegates proper validation to the mistral-common library.
340-461: Robust handling of optional fields in pad methodThe refactored pad method now correctly handles optional fields (
attention_maskandposition_ids) by only processing and including them when present in the input features. This defensive approach prevents errors when these fields are missing.
477-477: Good practice: explicit numpy dtypeUsing
np.int64instead ofnp.longis more explicit and portable across different platforms.
Codecov ReportAttention: Patch coverage is
📢 Thoughts on this report? Let us know! |
Description
Closes #2839
We remove the multiprocessing hack as the MistralTokenizer pickling has been solved mistralai/mistral-common#111 .
Add support for the Devstral models as requested in the linked Issue. The model already worked but this PR fixes some new bugs in the wrapper's
padand adds a lot of missing test for the MistralTokenizer class.Motivation and Context
How has this been tested?
Ran manually and added tests.
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit