🤗 Hugging Face | 📑 Paper
AutoThink is a reinforcement learning framework designed to equip R1-style language models with adaptive reasoning capabilities. Instead of always thinking or never thinking, the model learns when to engage in explicit reasoning, balancing performance and efficiency.
This repository implements AutoThink, as described in our paper:
Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Anonymous Authors (Under Review)
- 🧩 Minimal Prompting with ellipsis (
<think>\n...\n) to activate stochastic thinking. - 🎯 Multi-stage RL to stabilize, reinforce, and prune reasoning behavior.
 - ⚙️ Integrated with the 
verlframework. - 📊 Benchmarked on five mathematical reasoning datasets: MATH, Minerva, Olympiad, AIME24, AMC23.
 
Please clone the official DeepScaleR repository and follow its setup instructions:
Then, replace the following three folders in the original repo with ours:
cp -r code-release/verl     deepscaler/
cp -r code-release/scripts  deepscaler/
cp -r code-release/deepscaler deepscaler/Install the environment:
# Recommend Python 3.10.
cd deepscaler
pip install -e ./verl
pip install -e .The raw training data is located in deepscaler/data/[train|test], along with preprocessing scripts. To convert the raw data into Parquet files for training, run:
# Output parquet files in data/*.parquet.
python scripts/data/deepscaler_dataset.pyYou can control the model's reasoning behavior by modifying the chat_template field in tokenizer_config.json. Update the value with one of the following:
- Standard Prompt (default for Distill-R1, no changes needed):
 
"<|Assistant|><think>\n"- No-Thinking Prompt (forces minimal reasoning):
 
"<|Assistant|><think>\nOkay, I think I have finished thinking.\n</think>\n\n"- Ellipsis Prompt (adaptive reasoning mode):
 
"<|Assistant|><think>\n...\n"These prompts enable different reasoning behaviors.
Before AutoThink training, please replace the default chat_template with Ellipsis Prompt and keep the inference prompt consistent.
AutoThink training proceeds in three stages with different reward designs:
# Stage 1: Stabilize dual-mode reasoning
bash scripts/train_stage1.sh
# Stage 2: Reinforce accurate behavior
bash scripts/train_stage2.sh
# Stage 3: Prune redundant reasoning
bash scripts/train_stage3.shMake sure to configure your model paths and data in scripts/train_*.sh.
After training, evaluate the model using:
bash scripts/eval/eval_model_1.5b.shAutoThink achievesefficiency–accuracy trade-offs, and exhibits two inference modes:
We build and reference on the following open source trunks, and thank the following sources for their contributions to the LLM-Reasoning open source community:



