DeepSeek R1 Implementation

Motivation

I wanted to recreate DeepSeek R1's results at a smaller scale, focusing on understanding the core mechanics by implementing everything from scratch. So this is a repo that trains Qwen1.5B on the grade school math dataset.

This implementation heavily borrows from Will Brown's work (@willccbb), but restructures the code into a format optimized for learning and experimentation.

The key difference in my implementation is computing the GRPO loss function directly rather than using external RL libraries, and reformatting into a multi script repo.

I hope this might help other people understand things better, and maybe provide an easier way to try out smaller scale ideas etc.

Installation

pip install -r requirements.txt

Required environment variables:

export HUGGINGFACE_TOKEN="your-token-here"
huggingface-cli login

Implementation Details

The system consists of several key modules:

main.py

Contains the core training loop implementing GRPO (Generalized Reward-Powered Optimization). Handles model training, evaluation, and metric tracking.

llms.py

Manages model loading and configuration, currently supporting LLaMA + Qwen models through Hugging Face's transformers library. Designed to be easily extensible to other model architectures.

rldatasets.py

Handles dataset loading and preprocessing, currently focused on GSM8K math problems. Implements custom data loaders for both training and evaluation.

evaluator.py

Contains evaluation metrics and reward functions, closely following DeepSeek's original implementation.

Results

Training was conducted on a single H100 GPU. After ~400 training steps:

And results on the validation set - this shows a clearer sign of learning:

Future Directions

I'm really pleased to see how well the key mechanics work even in this simplified implementation. Building on this, I am very excited about several directions:

Adding self-play capabilities where agents compete and learn from each other using relative rewards. This would create a more dynamic training environment where the reward signal comes from agent interactions rather than fixed metrics.
Implementing soft reward structures, particularly for complex reasoning tasks. I've writing a framework for AI debate that I'm excited to try out.
Expanding into vision-language models (VLMs) to improve world modeling capabilities. I have an idea about using R1-style training to enhance how VLMs build and maintain internal world models that I'm really excited to explore. (Really excited about this idea - if anyone else is interested I would love to talk.)
I'd like to do all this experimentation in this framework, so I need to make things faster, and support multi-gpu training.

Collaboration?

I have many more experiments I'd love to run but am severely compute-limited. If you're an organization with available compute resources and interest in exploring these directions, I'd be very excited to collaborate! Please reach out to discuss potential experiments. I can be reached at [email protected] or on twitter @brendanh0gan. :)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
llms.py		llms.py
main.py		main.py
plotter.py		plotter.py
requirements.txt		requirements.txt
rldatasets.py		rldatasets.py
run.sh		run.sh
training_score.png		training_score.png
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek R1 Implementation

Motivation

Installation

Implementation Details

main.py

llms.py

rldatasets.py

evaluator.py

Results

Future Directions

Collaboration?

About

Releases

Packages

Contributors 2

Languages

License

brendanhogan/DeepSeekRL-Extended

Folders and files

Latest commit

History

Repository files navigation

DeepSeek R1 Implementation

Motivation

Installation

Implementation Details

main.py

llms.py

rldatasets.py

evaluator.py

Results

Future Directions

Collaboration?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages