Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] KL Divergence equation #298

Open
mountinyy opened this issue Mar 25, 2023 · 3 comments
Open

[Chatllama] KL Divergence equation #298

mountinyy opened this issue Mar 25, 2023 · 3 comments

Comments

@mountinyy
Copy link
Contributor

mountinyy commented Mar 25, 2023

Hello, I have a quick question.
I know most RLHF structure use KL divergence.

https://github.com/nebuly-ai/nebullvm/blob/aad1c09ce20946294df3ec83569bad9496f58d0e/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L871-L877

However, when you see the InstructGPT paper(link), they are actually dividing (policy for RL) by (policy for SFT). (which is action_log_prob - old_action_log_probs from your code).
And I saw some codes that actually follow this instructions like
ColossalAI


But it seems like you are doing oppositely.
I know you are converting loss to negative since you're applying RL reward for deep learning loss.
But even consdiering that, you are adding those KL value, not subtracting it, whcih means it's still oppostie to action_log_prob - old_action_log_probs.

Do you have any special reason for this?

Thank you for reading :)

@PierpaoloSorbellini
Copy link
Collaborator

Hi @mountinyy, Thanks for reaching out!
Yep it would agree with you on the matter.
We are going to test out the changes and reach back to you!

@mountinyy
Copy link
Contributor Author

@PierpaoloSorbellini Thank you for the reply! Please let me know the result :)

@PierpaoloSorbellini
Copy link
Collaborator

Hi @mountinyy I think now it's more coherent but we still having some issues with RLHF.
We are digging deeper to find the root cause.
thanks for the advice you should already see the change in the #306

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants