Skip to content

Comments

Adds reward bootstrapping to PPOTrainer#1536

Closed
ejmejm wants to merge 1 commit intohuggingface:mainfrom
ejmejm:ppo-boostrapping
Closed

Adds reward bootstrapping to PPOTrainer#1536
ejmejm wants to merge 1 commit intohuggingface:mainfrom
ejmejm:ppo-boostrapping

Conversation

@ejmejm
Copy link
Contributor

@ejmejm ejmejm commented Apr 13, 2024

The current return calculations for the PPO trainer assumes that every response terminates without truncation (https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1159).

nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0

When an episode is truncated early, it is technically correct to either drop the episode, or more commonly, bootstrap the reward with the value of the final state (e.g. last reward <- last reward + gamma * value of final state). This change adds a bootstrap_rewards option to PPOConfig that enables bootstrapping when sequences do not end with an EOS token. An alternative way to implement this is to have the user pass in a list of which responses were terminated early, but I felt the option that was simpler for the user would be better.

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Apr 17, 2024

Hi @ejmejm, Thanks for raising this issue. This is indeed an interesting alternative design, and it makes intuitive sense. However, it might be better to use the EOS trick we found in here. The idea is to replace the scores of completions that do not end with EOS tokens with -1.

In fact, the reward bootstrapping prob won't work because of how value networks are created. If the value network is randomly initialized, then the bootstrapped reward is a random score. If the value network is initialized from a reward model, the logits corresponding to the non-EOS tokens could be even misleading (for example, trained RM could have mostly negative numbers for the non-EOS tokens). WDYT?

image

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Apr 17, 2024

Related #1540

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants