A journey of a thousand miles is made one small step at a time.
Miles is an enterprise-facing reinforcement learning framework for large-scale MoE post-training and production workloads, forked from and co-evolving with slime.
Miles keeps slime’s lightweight, modular design, but focuses on:
- New hardware support (e.g., GB300 and beyond)
- Stable, controllable RL for large MoE models
- Production-grade features
- [2025/11] Introduce Miles - born after slime towards enterprise RL training (blog).
- Quick Start
- Arguments Walkthrough
- Developer Guide
- Recent Updates
- Roadmap
- Architecture Overview
- FAQ & Acknowledgements
Note: Miles is under active development. Commands and examples may evolve; please check the repo for the latest instructions.
For a comprehensive quick start guide covering environment setup, data preparation, training startup, and key code analysis, please refer to:
We also provide examples for some use cases not covered in the quick start guide; please check examples.
Arguments in Miles follow the same three-layer pattern as slime:
-
Megatron arguments: Megatron arguments are exposed unchanged, e.g.
--tensor-model-parallel-size 2. -
SGLang arguments: All SGLang arguments are exposed with a prefix
--sglang-, e.g.--mem-fraction-static→--sglang-mem-fraction-static. -
*Miles-specific arguments: Please refer to
miles/utils/arguments.pyfor a full list
For more detailed usage, please refer to the documentation and example configs in the repo as they become available.
Miles starts from slime’s proven backbone and adds a series of upgrades for production environments. The recent PRs and changes have also been synced to slime side.
Miles extends slime’s deterministic training and supports infrastructure-level true on-policy support for SGLang + FSDP:
- Keeps the mismatch between training and inference effectively at zero
- Aligns numerical behavior end-to-end between training and deployment
- Uses:
- FlashAttention-3
- DeepGEMM
- Batch-invariant kernels from Thinking Machines Lab
torch.compileand careful alignment of numeric operations
This makes Miles suitable for high-stakes experiments where repeatability, auditability, and production debugging matter.
To fully utilize precious GPU memory without constant OOM failures, Miles includes:
- Graceful handling of benign OOMs via error propagation
- Memory margins to avoid NCCL-related OOM issues
- Fixes for FSDP excessive memory usage
- Support for move-based and partial offloading
- Host peak memory savings for smoother multi-node training
The goal is to let large MoE jobs run closer to the hardware limit while staying stable.
Miles adds speculative training support tailored for RL:
- Performs online SFT on the draft model during RL, instead of freezing it
- Avoids draft policy drift away from the target model
- Achieves 25%+ rollout speedup vs. frozen MTP, especially in later training stages
- Includes:
- MTP with sequence packing + CP
- Proper loss masking and edge-case handling
- LM head / embedding gradient isolation
- Weight sync flows between Megatron and SGLang
Miles actively tracks new hardware and provides usable examples:
- GB300 training support, with more recipes coming
- A formal mathematics (Lean) example with SFT / RL scripts, showcasing Miles in a verifiable environment setting
Additional engineering improvements include:
- Enhanced FSDP training backend
- Option to deploy the rollout subsystem independently outside the main framework
- Better debugging & profiling: more metrics, post-hoc analyzers, and profiler integration
- Gradual refactoring for clarity and maintainability
We are actively evolving Miles toward a production-ready RL engine for large-scale MoE and multimodal workloads. Current roadmap items include:
- Large-scale MoE RL recipes on new hardware (e.g., GB300 and successors)
- Multimodal training support
- Rollout accelerations
- Compatibility with SGLang spec v2 for improved performance
- More advanced speculative training schemes (e.g., EAGLE3-style, multi-spec layers)
- Elasticity & fault tolerance
- More robust handling of GPU / node failures in long-running jobs
- Resource scheduling for async training
- Balancing training and serving in large-scale asynchronous RL systems
We’ll continue to iterate based on feedback from users across research labs, startups, and enterprise teams.
Miles inherits slime’s core architecture as below.
Module overview:
-
training (Megatron)
Main training loop. Reads data from the Data Buffer and synchronizes parameters to the rollout subsystem after updates. -
rollout (SGLang + router)
Generates new samples, including rewards / verifier outputs, and writes them back to the Data Buffer. -
data buffer
Manages prompt initialization, custom data sources, and rollout generation strategies. Serves as the bridge between training and rollout.
This decoupled design lets you:
- Swap in different algorithms / reward functions without touching rollout code
- Customize rollout engines independently from training
- Scale rollouts and training differently depending on hardware and deployment constraints
-
Contributions welcome! We’re especially interested in:
- New hardware backends & tuning
- MoE RL recipes
- Stability / determinism improvements
- Multimodal & speculative training use cases
-
We recommend using pre-commit to keep style consistent:
apt install pre-commit -y
pre-commit install
# run pre-commit to ensure code style consistency
pre-commit run --all-files --show-diff-on-failure --color=always- For debugging tips, performance tuning, and internal architecture notes, see the
docs/anddeveloper_guide/folders (coming soon).
- For FAQs, please see
docs/en/get_started/qa.md(to be added as the project matures). - Huge thanks to the slime authors and community — Miles would not exist without slime’s design and ecosystem.
- We also acknowledge and rely on the broader LLM infra ecosystem, including SGLang, Megatron-LM, and related tools.
- Miles GitHub: https://github.com/radixark/miles
- slime GitHub: https://github.com/THUDM/slime
We’re excited to see what you build — whether you choose slime, Miles, or both in different parts of your stack. 🚀

