DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.
It leverages the following key components:
- RL Framework: verl
- RL Algorithms: REINFORCE++
- RL Dataset: Social IQa
- Base Models: Qwen2.5 (3B), Llama3.2 (3B)
- Math Evaluation: DeepSeek-Math
Social IQa:
- Designed to probe emotional and social intelligence in everyday scenarios.
- Example:
- Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
- A: "To make sure no one else could hear."
- Dataset preprocessing is implemented in
DeepEnlighten/examples/data_preprocess/social_iqa.py
. - Raw and processed datasets can be found in
DeepEnlighten/data
. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.
- Reward modelling is implemented in
DeepEnlighten/verl/utils/reward_score/socialiqa.py
. - Rules:
- Format Reward: +2 if valid, -2 if invalid.
- Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
- Language Consistency Reward or Others: not applied.
After configuring your WandB, GPUs, and other settings, execute the training:
bash run_rl_trainer_xxx.sh
For details, refer to:
- DeepEnlighten Training Report
analysis
directory: Contains log analysis of CoT, language mixing, and "aha moment".evaluation
directory: Contains evaluation results on math benchmarks.
- Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.
(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)
Task | DeepEnglighten-3B | Llama3.2-3B-Instruct |
---|---|---|
math-cot-test | 0.4419 (3750) | 0.2672 (3750) |
cmath-cot-test | 0.5995 (824) | 0.5480 (823) |
gsm8k-cot-test | 0.7576 (330) | 0.7660 (329) |
- Longer CoT does not consistently appear across different experiments.
- Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
- Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
- Further experiments are required to validate these observations.
- While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
- This aligns with superficial self-reflection findings from OAT-ZERO.
- Left Figure: Answer accuracy versus token count distribution.
- Right Figure: Regression analysis of accuracy against token count.
- While language mixing is observed, it is not prevalent.
- Example: "购买电影票是娱乐的行为,是一种人性性行为,反映了人 Seekingjoy, pleasure and entertainment's需要。"
(Base Model = Llama3.2-3B-Instruct)
Category | Count | Percentage |
---|---|---|
Only English | 96674 | 98.23% |
Only Chinese | 0 | 0.00% |
Mixed (English & Chinese) | 1727 | 1.75% |
This project builds upon and references several open-source works:
- Logic-RL-Lite: Reproduction of R1-Zero on logic puzzles.
- verl Framework: Reinforcement learning framework.
- DeepSeek-Math: Mathematical reasoning benchmarks.
- Social IQa Dataset: Social reasoning dataset.