Feat: add devstral model support by NanoCode012 · Pull Request #2880 · axolotl-ai-cloud/axolotl

NanoCode012 · 2025-07-08T13:55:35Z

Description

We remove the multiprocessing hack as the MistralTokenizer pickling has been solved mistralai/mistral-common#111 .

Add support for the Devstral models as requested in the linked Issue. The model already worked but this PR fixes some new bugs in the wrapper's pad and adds a lot of missing test for the MistralTokenizer class.

Motivation and Context

How has this been tested?

Ran manually and added tests.

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Added documentation and configuration for fine-tuning the Devstral Small 24B model with Axolotl, including step-by-step instructions and a QLoRA training config.
Documentation
- Updated and clarified installation and setup instructions for Magistral and Devstral examples.
Bug Fixes
- Improved robustness in data padding and batching, ensuring optional fields are handled safely.
Refactor
- Simplified chat template and tokenizer logic, removing unnecessary multiprocessing checks and streamlining chat request creation.
Tests
- Expanded and parameterized tests for Mistral/Devstral tokenizers, including new cases for padding and tool calling.

coderabbitai · 2025-07-08T13:55:43Z

Walkthrough

Support for the Devstral model from MistralAI was added, including documentation, configuration for QLoRA fine-tuning, and comprehensive tests. Several code changes were made to generalize Mistral model support, remove the multiprocessing restriction for tokenization, and improve padding and chat template handling. Related documentation was updated for clarity.

Changes

File(s)	Change Summary
examples/devstral/README.md, examples/devstral/devstral-small-qlora.yml	Added Devstral fine-tuning documentation and a QLoRA configuration YAML for Devstral.
examples/magistral/README.md	Simplified installation and setup instructions; clarified dataset format and tokenizer limitations.
src/axolotl/datasets.py	Removed logic disabling multiprocessing for tokenizers lacking support, allowing multiprocessing for all tokenizers.
src/axolotl/prompt_strategies/chat_template.py, src/axolotl/prompt_tokenizers.py	Removed the `supports_multiprocessing` property from tokenizing strategies and Mistral strategy; improved handling of training fields in chat templates.
src/axolotl/utils/collators/batching.py	Safeguarded deletion of "attention_mask" in padded features to avoid KeyError.
src/axolotl/utils/mistral_tokenizer.py	Refactored chat template application to use `from_openai`; improved padding logic to handle optional fields more defensively; removed creation of default `position_ids`.
tests/prompt_strategies/conftest.py, tests/prompt_strategies/test_chat_templates_mistral.py	Added Devstral tokenizer fixture; refactored and parameterized tests to cover both Magistral and Devstral; added comprehensive tests for padding and tool calling scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Axolotl
    participant HFMistralTokenizer
    participant ChatCompletionRequest

    User->>Axolotl: Initiate fine-tuning with Devstral config
    Axolotl->>HFMistralTokenizer: Tokenize chat messages (apply_chat_template)
    HFMistralTokenizer->>ChatCompletionRequest: from_openai(messages, tools)
    ChatCompletionRequest-->>HFMistralTokenizer: Chat completion request object
    HFMistralTokenizer-->>Axolotl: Tokenized input
    Axolotl-->>User: Fine-tuning proceeds with tokenized data

Assessment against linked issues

Objective	Addressed	Explanation
Support Devstral by inheriting from Mistral‑Small‑3.1 (#2839)	✅
Add documentation and configuration for Devstral fine-tuning (#2839)	✅
Remove incompatibilities and generalize code for Devstral/Mistral models (#2839)	✅
Add tests for Devstral tokenizer and chat template support (#2839)	✅

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes found.

Possibly related PRs

axolotl-ai-cloud/axolotl#2780: Introduced the supports_multiprocessing property and logic for disabling multiprocessing in tokenizers, which this PR now removes or modifies, making them directly related.
axolotl-ai-cloud/axolotl#2680: Refactored model loader modules; related as this PR adds Devstral examples relying on those loaders.

Suggested reviewers

NanoCode012
SalmanMohammadi

Poem

A rabbit hopped in, with code to compile,
Devstral now joins the Axolotl file!
Multiprocessing unshackled, tests shining bright,
Padding and templates are working just right.
With YAML and docs, the future is clear—
New models to train, let’s all give a cheer!
🐰✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d88afa8 and 87e99d6.

📒 Files selected for processing (1)

tests/prompt_strategies/test_chat_templates_mistral.py (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: docker-e2e-tests-1st (126, 12.6.3, 3.11, 2.6.0, 1, Dockerfile-uv.jinja)
GitHub Check: docker-e2e-tests-1st (126, 12.6.3, 3.11, 2.6.0, 1)
GitHub Check: preview

🔇 Additional comments (6)

tests/prompt_strategies/test_chat_templates_mistral.py (6)

6-6: Good addition of pytest import for comprehensive testing

The import enables proper use of pytest features like parameterization and exception handling throughout the test file.

12-30: Excellent use of pytest parameterization for multi-tokenizer testing

The parameterized approach elegantly tests both magistral and devstral tokenizers with their respective expected token IDs, ensuring consistency across both implementations.

240-240: Smart use of tuple unpacking for dynamic test expectations

The *assistant_toolcall_ids unpacking allows the same test logic to work with different tokenizers that have different expected token sequences.

Also applies to: 250-250

307-435: Comprehensive pad method testing with excellent coverage

The test thoroughly validates:

Basic padding functionality with input_ids and labels

Optional field handling (attention_mask, position_ids)

Different tensor return types (PyTorch, NumPy)

Edge cases like same-length sequences

Error handling for unsupported fields

The use of pytest.raises() properly addresses previous review feedback about assert False usage.

437-747: Thorough tool calling validation with realistic scenarios

The test suite covers:

Single tool calls with proper response handling

Sequential multiple tool calls

System message integration with tools

Error handling for incomplete tool call sequences

The validation approach using decoded string checks effectively verifies the tokenizer's output format.

741-746: Proper exception handling for tool calling validation

The test correctly catches and validates the expected InvalidMessageStructureException when tool calls and responses don't match, ensuring robust error handling.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

examples/devstral/devstral-small-qlora.yml (1)
64-65: Incomplete special_tokens configuration

The special_tokens field on line 65 is missing its value. This should either be removed if not needed, or completed with the appropriate special token mappings.

Apply this diff to remove the incomplete field:
 weight_decay: 0.0
-special_tokens:
Or complete it with appropriate special tokens if needed:
 weight_decay: 0.0
-special_tokens:
+special_tokens:
+  pad_token: "<pad>"

🧹 Nitpick comments (1)

examples/devstral/README.md (1)
1-70: Comprehensive documentation for new Devstral model support.

The README provides excellent documentation for the new Devstral model, including detailed installation instructions, configuration examples, and proper limitations disclosure. The structure follows the existing pattern established by other model examples.

However, there are several minor grammatical and formatting issues that should be addressed:
-Devstral Small is a 24B parameter opensource model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505).
+Devstral Small is a 24B parameter open-source model from MistralAI found on HuggingFace [Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505).

-The model was fine-tuned ontop of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of upto 128k tokens.
+The model was fine-tuned on top of [Mistral-Small-3.1](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) without the vision layer and has a context of up to 128k tokens.

-You need to install from main as Devstral is only on nightly or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).
+You need to install from main as Devstral is only on nightly, or use our latest [Docker images](https://docs.axolotl.ai/docs/docker.html).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1032e22 and d88afa8.

📒 Files selected for processing (10)

examples/devstral/README.md (1 hunks)
examples/devstral/devstral-small-qlora.yml (1 hunks)
examples/magistral/README.md (3 hunks)
src/axolotl/datasets.py (0 hunks)
src/axolotl/prompt_strategies/chat_template.py (1 hunks)
src/axolotl/prompt_tokenizers.py (0 hunks)
src/axolotl/utils/collators/batching.py (1 hunks)
src/axolotl/utils/mistral_tokenizer.py (6 hunks)
tests/prompt_strategies/conftest.py (1 hunks)
tests/prompt_strategies/test_chat_templates_mistral.py (6 hunks)

💤 Files with no reviewable changes (2)

src/axolotl/prompt_tokenizers.py
src/axolotl/datasets.py

🧰 Additional context used

🧠 Learnings (1)

examples/magistral/README.md (1)

Learnt from: NanoCode012
PR: axolotl-ai-cloud/axolotl#2854
File: README.md:73-77
Timestamp: 2025-07-02T02:56:20.788Z
Learning: For Axolotl Docker commands, the `--ipc=host` flag should be included by default to prevent shared memory failures that commonly occur with PyTorch DataLoaders and multiprocessing during machine learning training workflows.

🪛 LanguageTool

examples/magistral/README.md

[grammar] ~24-~24: There might be a mistake here.
Context: ...tion -e '.[flash-attn]' 2. Run the finetuning example: bash axolotl train example...

(QB_NEW_EN_OTHER)

[grammar] ~24-~24: Use proper spacing conventions.
Context: ...tn]' 2. Run the finetuning example: bash axolotl train examples/magistral/magistral-small-qlora.yaml ``` This config uses about 24GB VRAM. Let u...