Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy Log Probs and Reference Log Probs differ at 1st iteration of DPO/RPO #227

Open
shengyangs opened this issue Jul 3, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@shengyangs
Copy link
Collaborator

Describe the bug

In DPO and all its variants, the policy is initialized at the reference policy. Therefore, in the first iteration, the log probs from the policy and the log probs from the reference policy should be exactly the same.

However, I found that the log probs differ at the 1st iteration, as shown in the figure below.

TP4 DP1. They differ.
Screen Shot 2024-07-02 at 4 49 53 PM

TP2 DP1. They are exactly the same.
Screen Shot 2024-07-02 at 4 47 42 PM

Steps/Code to reproduce bug

  1. Pick a model
  2. Set TP=4
  3. Print out the pi_logprobs and ref_logprobs at iteration=0

Expected behavior

No matter the GBS, MBS, TP, PP, DP, Forward-MBS, they should be exactly the same.

@shengyangs shengyangs added the bug Something isn't working label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants