Skip to content

Pure RL to post-train base models for social reasoning capabilities. Lightweight replication of DeepSeek-R1-Zero with Social IQa dataset.

Notifications You must be signed in to change notification settings

DolbyUUU/DeepEnlighten

Repository files navigation

DeepEnlighten: Generalization from EQ to IQ

DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.

It leverages the following key components:

  1. RL Framework: verl
  2. RL Algorithms: REINFORCE++
  3. RL Dataset: Social IQa
  4. Base Models: Qwen2.5 (3B), Llama3.2 (3B)
  5. Math Evaluation: DeepSeek-Math

Dataset

Social IQa:

  • Designed to probe emotional and social intelligence in everyday scenarios.
  • Example:
    • Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
    • A: "To make sure no one else could hear."
  • Dataset preprocessing is implemented in DeepEnlighten/examples/data_preprocess/social_iqa.py.
  • Raw and processed datasets can be found in DeepEnlighten/data. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.

Rule-Based Rewards

  • Reward modelling is implemented in DeepEnlighten/verl/utils/reward_score/socialiqa.py.
  • Rules:
    • Format Reward: +2 if valid, -2 if invalid.
    • Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
    • Language Consistency Reward or Others: not applied.

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For details, refer to:

  • DeepEnlighten Training Report
  • analysis directory: Contains log analysis of CoT, language mixing, and "aha moment".
  • evaluation directory: Contains evaluation results on math benchmarks.

1. Generalization from EQ to IQ

  • Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.

Table: Accuracy in Mathematical Reasoning CoT Tests

(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)

Task DeepEnglighten-3B Llama3.2-3B-Instruct
math-cot-test 0.4419 (3750) 0.2672 (3750)
cmath-cot-test 0.5995 (824) 0.5480 (823)
gsm8k-cot-test 0.7576 (330) 0.7660 (329)

2. Longer CoT and Overthinking Phenomenon

  • Longer CoT does not consistently appear across different experiments.
  • Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
  • Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
  • Further experiments are required to validate these observations.

3. Longer CoT ≠ Higher EQ

  • While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
  • This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Llama3.2-3B-Instruct):

  • Left Figure: Answer accuracy versus token count distribution.
  • Right Figure: Regression analysis of accuracy against token count.
Barplot: Answer Accuracy vs Token Count Regression: Answer Accuracy vs Token Count

4. Language Mixing Does Exist

  • While language mixing is observed, it is not prevalent.
  • Example: "购买电影票是娱乐的行为,是一种人性性行为,反映了人 Seekingjoy, pleasure and entertainment's需要。"

Table: Language Distribution in Model Thinking

(Base Model = Llama3.2-3B-Instruct)

Category Count Percentage
Only English 96674 98.23%
Only Chinese 0 0.00%
Mixed (English & Chinese) 1727 1.75%

Acknowledgements

This project builds upon and references several open-source works:

About

Pure RL to post-train base models for social reasoning capabilities. Lightweight replication of DeepSeek-R1-Zero with Social IQa dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published