You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, when you see the InstructGPT paper(link), they are actually dividing (policy for RL) by (policy for SFT). (which is action_log_prob - old_action_log_probs from your code).
And I saw some codes that actually follow this instructions like ColossalAI
But it seems like you are doing oppositely.
I know you are converting loss to negative since you're applying RL reward for deep learning loss.
But even consdiering that, you are adding those KL value, not subtracting it, whcih means it's still oppostie to action_log_prob - old_action_log_probs.
Do you have any special reason for this?
Thank you for reading :)
The text was updated successfully, but these errors were encountered:
Hi @mountinyy I think now it's more coherent but we still having some issues with RLHF.
We are digging deeper to find the root cause.
thanks for the advice you should already see the change in the #306
Hello, I have a quick question.
I know most RLHF structure use KL divergence.
https://github.com/nebuly-ai/nebullvm/blob/aad1c09ce20946294df3ec83569bad9496f58d0e/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L871-L877
However, when you see the InstructGPT paper(link), they are actually dividing (policy for RL) by (policy for SFT). (which is action_log_prob - old_action_log_probs from your code).
And I saw some codes that actually follow this instructions like
ColossalAI
But it seems like you are doing oppositely.
I know you are converting loss to negative since you're applying RL reward for deep learning loss.
But even consdiering that, you are adding those KL value, not subtracting it, whcih means it's still oppostie to action_log_prob - old_action_log_probs.
Do you have any special reason for this?
Thank you for reading :)
The text was updated successfully, but these errors were encountered: