DeepEnlighten: Generalization from EQ to IQ

DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.

It leverages the following key components:

RL Framework: verl
RL Algorithms: REINFORCE++
RL Dataset: Social IQa
Base Models: Qwen2.5 (3B), Llama3.2 (3B)
Math Evaluation: DeepSeek-Math

Dataset

Social IQa:

Designed to probe emotional and social intelligence in everyday scenarios.
Example:
- Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
- A: "To make sure no one else could hear."
Dataset preprocessing is implemented in DeepEnlighten/examples/data_preprocess/social_iqa.py.
Raw and processed datasets can be found in DeepEnlighten/data. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.

Rule-Based Rewards

Reward modelling is implemented in DeepEnlighten/verl/utils/reward_score/socialiqa.py.
Rules:
- Format Reward: +2 if valid, -2 if invalid.
- Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
- Language Consistency Reward or Others: not applied.

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For details, refer to:

DeepEnlighten Training Report
analysis directory: Contains log analysis of CoT, language mixing, and "aha moment".
evaluation directory: Contains evaluation results on math benchmarks.

1. Generalization from EQ to IQ

Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.

Table: Accuracy in Mathematical Reasoning CoT Tests

(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)

Task	DeepEnglighten-3B	Llama3.2-3B-Instruct
math-cot-test	0.4419 (3750)	0.2672 (3750)
cmath-cot-test	0.5995 (824)	0.5480 (823)
gsm8k-cot-test	0.7576 (330)	0.7660 (329)

2. Longer CoT and Overthinking Phenomenon

Longer CoT does not consistently appear across different experiments.
Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
Further experiments are required to validate these observations.

3. Longer CoT ≠ Higher EQ

While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Llama3.2-3B-Instruct):

Left Figure: Answer accuracy versus token count distribution.
Right Figure: Regression analysis of accuracy against token count.

Regression: Answer Accuracy vs Token Count

4. Language Mixing Does Exist

While language mixing is observed, it is not prevalent.
Example: "购买电影票是娱乐的行为，是一种人性性行为，反映了人 Seekingjoy, pleasure and entertainment's需要。"

Table: Language Distribution in Model Thinking

(Base Model = Llama3.2-3B-Instruct)

Category	Count	Percentage
Only English	96674	98.23%
Only Chinese	0	0.00%
Mixed (English & Chinese)	1727	1.75%

Acknowledgements

This project builds upon and references several open-source works:

Logic-RL-Lite: Reproduction of R1-Zero on logic puzzles.
verl Framework: Reinforcement learning framework.
DeepSeek-Math: Mathematical reasoning benchmarks.
Social IQa Dataset: Social reasoning dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
analysis		analysis
data		data
evaluation		evaluation
examples		examples
tests		tests
verl		verl
verl_miscellaneous		verl_miscellaneous
verl_others		verl_others
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
log.log		log.log
requirements.txt		requirements.txt
run_rl_trainer_llama_instruct.sh		run_rl_trainer_llama_instruct.sh
run_rl_trainer_qwen_base.sh		run_rl_trainer_qwen_base.sh
run_rl_trainer_qwen_instruct.sh		run_rl_trainer_qwen_instruct.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepEnlighten: Generalization from EQ to IQ

Dataset

Rule-Based Rewards

Training

Key Findings

1. Generalization from EQ to IQ

Table: Accuracy in Mathematical Reasoning CoT Tests

2. Longer CoT and Overthinking Phenomenon

3. Longer CoT ≠ Higher EQ

Figures (Base Model = Llama3.2-3B-Instruct):

4. Language Mixing Does Exist

Table: Language Distribution in Model Thinking

Acknowledgements

About

Releases

Packages

Languages

DolbyUUU/DeepEnlighten

Folders and files

Latest commit

History

Repository files navigation

DeepEnlighten: Generalization from EQ to IQ

Dataset

Rule-Based Rewards

Training

Key Findings

1. Generalization from EQ to IQ

Table: Accuracy in Mathematical Reasoning CoT Tests

2. Longer CoT and Overthinking Phenomenon

3. Longer CoT ≠ Higher EQ

Figures (Base Model = Llama3.2-3B-Instruct):

4. Language Mixing Does Exist

Table: Language Distribution in Model Thinking

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages