Paper Links
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
environments
pip install math_verify # reward function
pip install "trl>=0.15"
Note: It is normal for the loss to approach zero during training. Refer to this issue for more details.
A reward function takes the text completions
generated by a model and other columns from the dataset as parameters, and scores the model's generated text. Below is an example that demonstrates how to implement a simple length-based reward function. This function will give a reward signal of 1.0 if the length of the generated text exceeds 1024; otherwise, the reward signal will be 0.0.
from swift.plugin.orm import ORM, orms
class DummyLengthRewardFunction(ORM):
def __call__(self, completions, **kwargs):
return [1.0 if len(completion) > 1024 else 0.0 for completion in completions]
orms['dummy']= DummyLengthRewardFunction
You can add this reward function in swift/examples/train/grpo/plugin/plugin.py
and register it using the parameter --external_plugins examples/train/grpo/plugin/plugin.py
, then specify it using the reward_funcs parameter.
For an example of how to execute the script, refer to here.
Swift provides four rule-based reward functions built into the system: accuracy, format, cosine, and repetition. (The code can be found in swift/plugin/orm.py.)
The accuracy and format reward functions are based on the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, while the cosine and repetition reward functions are derived from the paper Demystifying Long Chain-of-Thought Reasoning in LLMs.
- accuracy
This function compares the model's generated result with the solution column in the dataset to calculate an accuracy score. If the generated result matches the standard answer, the score is 1.0; otherwise, it is 0.0.
- format
The paper uses the following system prompt to enforce a fixed format for model responses:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>
This function checks whether the model generates text in the format <think>think content</think><answer>answer content</answer>
. If the generated text adheres to the format requirements, the score is 1.0; otherwise, it is 0.0.
- cosine
The paper found that training with only the accuracy reward function could lead to overly long generated sequences, affecting training performance. The cosine reward function optimizes the training process by controlling the length of the generated sequences:
- For text that generates the correct answer, the reward value decreases as the length increases, encouraging concise responses.
- For text that generates incorrect answers, the reward value increases as the length increases, encouraging deeper reasoning.
A cosine function is used to smoothly adjust the reward value, ensuring that the changes are within a reasonable range. The parameters for the cosine function include the length of the generated text, the maximum length limit, and the minimum and maximum reward values.
Parameters:
cosine_min_len_value_wrong
(default: 0.0): Reward value corresponding to the minimum length when the answer is incorrect.cosine_max_len_value_wrong
(default: -0.5): Reward value corresponding to the maximum length when the answer is incorrect.cosine_min_len_value_correct
(default: 1.0): Reward value corresponding to the minimum length when the answer is correct.cosine_max_len_value_correct
(default: 0.5): Reward value corresponding to the maximum length when the answer is correct.cosine_max_len
(default value equal to the model's maximum generation capacity): Maximum length limit for generated text.
- repetition
This function penalizes repetition in generated text by detecting repeated n-gram patterns and assigning penalties based on the level of repetition.
The function splits the generated text into words and extracts n-grams of a specified size (default is 3-gram). It calculates the repetition ratio based on the proportion of unique n-grams to the total number of n-grams. If the proportion of repeated n-grams is high, a significant negative reward (penalty) is applied. The penalty value is computed based on the repetition ratio and a maximum penalty value (default: -1.0).
Parameters:
repetition_n_grams
(default: 3): Size of the n-gram used to detect repetition.repetition_max_penalty
(default: -1.0): Maximum penalty value, which controls the intensity of the penalty.
- Reward Models
In addition to rule-based reward functions, this framework also supports using reward models as reward functions. When using a reward model, you need to specify the reward_model
parameter, similar to the model
parameter, which is used to specify the path or name of the reward model. Note that either reward_model
or reward_funcs
needs to be specified.
Hyperparameters
- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_eval_batch_size * - nproc_per_node.
- max_completion_length: The maximum length for sampling generation, default is 512.
- ds3_gather_for_generation: This parameter applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. The default is True.
- reward_funcs: Reward functions to score the results generated by the model. Includes built-in accuracy, format , cosine and repetition rule-based functions, detailed in the swift/plugin/orm.py file.
- reward_weights: Weights for each reward function. Must match the number of reward functions. If
None
, all rewards are weighted equally with weight1.0
.- Note: If
--reward_model
is included in GRPO training, it is added to the end of the reward functions.
- Note: If
- log_completions: Whether to log the model-generated content during training, to be used in conjunction with
--report_to wandb
, default is False.- Note: If
--report_to wandb
is not set, acompletions.jsonl
will be created in the checkpoint to store the generated content.
- Note: If
- use_vllm: Whether to use vLLM as the back-end for sampling generation; default is False, using it is recommended to speed up training.
- vllm_device: Device for deploying vLLM, default is auto, meaning the first unused GPU. Use cuda:x to specify a particular card.
- vllm_gpu_memory_utilization: vLLM pass-through parameter.
- vllm_max_model_len: vLLM pass-through parameter.
- reward_model: Same as the model, using a reward model as a reward function. At least one of reward_funcs and reward_model needs to be specified.
The hyperparameters for the reward function can be found in the Built-in Reward Functions section.
It is recommended to use vLLM for sampling. In a multi-GPU environment, it is advisable to set aside one GPU specifically for vLLM deployment, in which case the number of processes should be one less than the number of GPUs.
Multi-GPU vLLM
# nproc_per_node is one less than the number of GPUs, with vLLM by default deployed on the last card, i.e., cuda:7
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=7 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-7B-Instruct \
--reward_funcs accuracy format cosine repetition\
--use_vllm true \
--vllm_device auto \
--vllm_gpu_memory_utilization 0.7 \
--vllm_max_model_len 8192 \
--train_type full \
--torch_dtype bfloat16 \
--dataset 'AI-MO/NuminaMath-TIR#5000' \
--max_completion_length 2048 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 2 \
--eval_steps 200 \
--save_steps 200 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 4096 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--num_generations 7 \
--temperature 0.9 \
--system 'examples/train/grpo/prompt.txt' \
--deepspeed zero2
Single-GPU
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-7B-Instruct \
--reward_funcs accuracy format cosine repetition\
--train_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--torch_dtype bfloat16 \
--dataset 'AI-MO/NuminaMath-TIR#1000' \
--max_completion_length 1024 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 1 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--num_generations 4 \
--temperature 0.9 \
--system 'examples/train/grpo/prompt.txt'