-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does the loss start at 0 when I train GRPO, and then possibly increase? #239
Comments
试试这个? |
i have the same problem |
I have same problem.😭 |
可能还不是这个问题,你的是格式奖励是0,但是你的loss不是0,而且我发现更奇怪的是,训练400步之后loss还能上升,感觉我这里的成了loss在上升的优化,不知道是不是损失函数,那个人忘了用1去减还是怎样。。这是我最新的,我还没停止,想再多看一会儿会怎样
|
same problem |
I tried lowering the version of math-verify to 0.3.3 (it was originally 0.5.2), and only kept the format and accuracy rewards, but it still doesn't work; the loss is still 0 initially.
|
With the exception of the format reward, other rewards are getting better
{'loss': 0.0013, 'grad_norm': 0.0053334906697273254, 'learning_rate': 1.7241379310344828e-05, 'completion_length': 789.725, 'rewards/accuracy_reward': 0.43671875, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.9627604238688946, 'rewards/cosine_scaled_reward': 0.171499810856767, 'reward': 1.5709789715707303, 'reward_std': 0.38927115853875877, 'kl': 0.03206634521484375, 'epoch': 0.09} 9%|▉ | 25/283 [8:04:04<82:58:09, 1157.71s/it] |
I use the modified format reward in @hellen9527 #235, But the loss is still strange. {'loss': 0.0, 'grad_norm': 0.9771971106529236, 'learning_rate': 2.142857142857143e-06, 'rewards/accuracy_reward': 0.6280612092465162, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.29370747953653337, 'rewards/cosine_scaled_reward': 0.3368528074584901, 'reward': 1.2586214922368526, 'reward_std': 0.7501622267067433, 'completion_length': 806.3540603637696, 'kl': 0.0010107040405273437, 'epoch': 0.07} {'loss': 0.0001, 'grad_norm': 0.4975500702857971, 'learning_rate': 2.999485987463336e-06, 'rewards/accuracy_reward': 0.7127550840377808, 'rewards/format_reward': 0.0010204081423580646, 'rewards/reasoning_steps_reward': 0.3047619042918086, 'rewards/cosine_scaled_reward': 0.4018472107127309, 'reward': 1.420384594798088, 'reward_std': 0.7048569574952126, 'completion_length': 756.1489646911621, 'kl': 0.003369712829589844, 'epoch': 0.11} {'loss': 0.2135, 'grad_norm': 1.6367053985595703, 'learning_rate': 2.981532510892707e-06, 'rewards/accuracy_reward': 0.7147959008812904, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.3149659845978022, 'rewards/cosine_scaled_reward': 0.4177613776177168, 'reward': 1.4475232884287834, 'reward_std': 0.6804138027131558, 'completion_length': 805.1270263671875, 'kl': 5.331846427917481, 'epoch': 0.15} {'loss': 0.0003, 'grad_norm': 0.5887525677680969, 'learning_rate': 2.9382296023022897e-06, 'rewards/accuracy_reward': 0.7341836579144001, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.3403061218559742, 'rewards/cosine_scaled_reward': 0.4471289239823818, 'reward': 1.521618703007698, 'reward_std': 0.6430120587348938, 'completion_length': 785.0775329589844, 'kl': 0.006349372863769531, 'epoch': 0.19} |
If you are using the GRPO trainer then the old policy is in effect updated every step, this means you just use a detached version of the current policy. The resultant probability ratio will always be 1. By definition of GRPO the advantage is standardised, so its expectation is 0. So the expectation of the probability ratio multiplied by the advantage will also always be zero. Although the loss is zero, there are still gradients in this case. The increase may be from KL increasing as you move away from the original distribution. |
How should I modify it? I saw the explanation saying that the purpose of keeping it always 1 here is to ensure that the loss reduction is fully allocated to the advantage function. I'm using the latest main branch of trl installed via python setup.py install. What should I change? |
After switching to the latest training script, I noticed that the initial loss is still 0, but both format_reward and accuracy_reward are slowly increasing. It seems these two are normal, but why is the loss still abnormal? Or does the loss log not matter at all?
|
It is completely normal for the loss to start at zero and then increase. Here’s whyThe first thing is to understand the GRPO objective, which is formulated as follows: where:
Note Here, what interests us is the absolute value of the loss. Therefore, the gradient-related parts of the terms can be ignored. To simplify, let’s assume we only perform one exploration step per iteration (which is the standard implementation of GRPO). Consequently, Remember that the advantage does not depend on Moreover, Therefore: In other words, in absolute terms, the loss is equal to the average KL divergence multiplied by Now, since the reference policy and the policy are initially equal, this is why the loss starts at zero. Training causes the policy to diverge from the initial reference policy, which is why the loss increases. Finally, this is entirely consistent with the equations. 🤗 |
...
@qgallouedec Thank you so much! I was even wondering if there was something wrong with my environment setup or code version. If this is normal, that's great! It means I can focus more on the actual training results. |
@qgallouedec, thanks for your clear explanation. However, I still wonder why GRPO only performs one exploration step per iteration, which is different from PPO. Could you elaborate on that? Furthermore, I refer to Algorithm 1 in the DeepSeekMath paper, by exploration step you mean which loop exactly? Is it the outermost iteration, step, or inner loop GRPO iteration? ![]() |
I am using the distill-1.5b model, and since I only have 4 L20 GPUs, I modified some parameters and am still training the GRPO model on the NuminaMath-TIR dataset. However, I noticed that the loss remains 0, and I'm not sure where the configuration went wrong. I have ensured that the software versions match those in the setup.py file, and I also updated TRL and transformers to the latest version of the main branch. The specific logs and training configuration are as follows. I would like to know if this is normal and how to fix it.
train config:
train log
The text was updated successfully, but these errors were encountered: