Feat: add MiMo and Plano#3332
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis pull request adds example configurations and documentation for two new models (MiMo and Plano-Orchestrator) with QLoRA fine-tuning setups, updates the main README to reference these examples, adds a limitations note to the Trinity example, and includes a model revision field in the Trinity configuration. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/plano/plano-4b-qlora.yaml (1)
1-65: Configuration looks good, but consider pinning a model revision.The QLoRA configuration is well-structured with appropriate parameters. The use of Cut Cross Entropy plugin aligns with the README guidance.
Consider adding
revision_of_modelfor reproducibility, similar to the Trinity and MiMo examples:base_model: katanemo/Plano-Orchestrator-4B +revision_of_model: <commit_hash>This ensures consistent behavior across training runs.
examples/mimo/mimo-7b-qlora.yaml (1)
15-17: Consider explicitly specifying chat_template.The dataset type is
chat_templatebut no explicitchat_templatefield is specified (unlike the Plano example which useschat_template: qwen3). If MiMo requires a specific chat template, it should be explicitly declared for clarity.If a specific template is needed, add it explicitly:
datasets: - path: fozziethebeat/alpaca_messages_2k_test type: chat_template + # chat_template: <template_name> # Specify if needed
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.mdexamples/mimo/README.mdexamples/mimo/mimo-7b-qlora.yamlexamples/plano/README.mdexamples/plano/plano-4b-qlora.yamlexamples/trinity/README.mdexamples/trinity/trinity-nano-preview-qlora.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T13:23:41.455Z
Learnt from: winglian
Repo: axolotl-ai-cloud/axolotl PR: 3095
File: src/axolotl/cli/merge_lora.py:65-81
Timestamp: 2025-08-22T13:23:41.455Z
Learning: The `lora_on_cpu` configuration in Axolotl is only relevant when loading the full model into memory (standard LoRA merge approach), not when processing individual shards in the memory-efficient approach.
Applied to files:
examples/plano/plano-4b-qlora.yamlexamples/mimo/mimo-7b-qlora.yaml
🪛 LanguageTool
examples/mimo/README.md
[style] ~17-~17: Consider using polite language here.
Context: ... This config uses about 17.2 GiB VRAM. Let us know how it goes. Happy finetuning! 🚀 ### ...
(INSERT_PLEASE)
examples/plano/README.md
[style] ~19-~19: Consider using polite language here.
Context: ...` This config uses about 5.1 GiB VRAM. Let us know how it goes. Happy finetuning! 🚀 ### ...
(INSERT_PLEASE)
🪛 markdownlint-cli2 (0.18.1)
examples/mimo/README.md
23-23: Link text should be descriptive
(MD059, descriptive-link-text)
examples/plano/README.md
30-30: Link text should be descriptive
(MD059, descriptive-link-text)
🔇 Additional comments (7)
README.md (1)
32-32: LGTM! Clear documentation of new model support.The update properly highlights the addition of Plano-Orchestrator and MiMo examples alongside existing models with correct links to their respective directories.
examples/mimo/README.md (1)
1-39: Well-documented guide for MiMo fine-tuning.The README provides clear instructions and context about MiMo's Multiple-Token Prediction approach, with appropriate references to the paper and resources.
examples/mimo/mimo-7b-qlora.yaml (2)
1-67: Configuration looks good with proper revision pinning.The QLoRA configuration is well-structured with
trust_remote_codeand pinned revision for reproducibility.
61-61: MiMo-7B-RL is a standard transformer model that supports flash attention through Axolotl's training framework. The configuration is valid and properly supported.examples/trinity/trinity-nano-preview-qlora.yaml (1)
3-3: Good practice: pinning model revision for reproducibility.The specific model revision
2ee94b0is valid for thearcee-ai/Trinity-Nano-Previewmodel and ensures consistent behavior across runs.examples/plano/README.md (1)
1-42: Documentation is accurate and well-structured.The README provides clear fine-tuning instructions for Plano-Orchestrator. The stated 5.1 GiB VRAM usage for 4B + QLoRA is realistic and within expected ranges for this configuration. Industry documentation confirms QLoRA fine-tuning of 4B models typically consumes 3–5 GB VRAM, and the Cut Cross Entropy plugin further reduces loss computation overhead by chunking logit materialization rather than creating full vocab×tokens matrices. No verification concerns identified.
examples/plano/plano-4b-qlora.yaml (1)
59-59: Plano-Orchestrator-4B supports flash attention, so the configuration is correct. The difference from the Trinity example (where flash attention is commented as "Not supported") reflects that these models have different attention capabilities—Trinity does not support flash attention while Plano-Orchestrator-4B does.
|
📖 Documentation Preview: https://694d1ccdba8931729e083521--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 88e451b |
Description
MiMo is an older model by Xiaomi
Plano is a model built on Qwen3 and Qwen3MoE arch.
Motivation and Context
How has this been tested?
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.