Adds reward bootstrapping to PPOTrainer by ejmejm · Pull Request #1536 · huggingface/trl

ejmejm · 2024-04-13T06:40:03Z

The current return calculations for the PPO trainer assumes that every response terminates without truncation (https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1159).

nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0

When an episode is truncated early, it is technically correct to either drop the episode, or more commonly, bootstrap the reward with the value of the final state (e.g. last reward <- last reward + gamma * value of final state). This change adds a bootstrap_rewards option to PPOConfig that enables bootstrapping when sequences do not end with an EOS token. An alternative way to implement this is to have the user pass in a list of which responses were terminated early, but I felt the option that was simpler for the user would be better.

vwxyzjn · 2024-04-17T13:54:36Z

Hi @ejmejm, Thanks for raising this issue. This is indeed an interesting alternative design, and it makes intuitive sense. However, it might be better to use the EOS trick we found in here. The idea is to replace the scores of completions that do not end with EOS tokens with -1.

In fact, the reward bootstrapping prob won't work because of how value networks are created. If the value network is randomly initialized, then the bootstrapped reward is a random score. If the value network is initialized from a reward model, the logits corresponding to the non-EOS tokens could be even misleading (for example, trained RM could have mostly negative numbers for the non-EOS tokens). WDYT?

vwxyzjn · 2024-04-17T13:54:57Z

Related #1540

github-actions · 2024-05-13T15:05:49Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Added reward bootstrapping to PPOTrainer

40d2f2c

github-actions bot closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Adds reward bootstrapping to PPOTrainer#1536

Adds reward bootstrapping to PPOTrainer#1536
ejmejm wants to merge 1 commit intohuggingface:mainfrom
ejmejm:ppo-boostrapping

ejmejm commented Apr 13, 2024

Uh oh!

vwxyzjn commented Apr 17, 2024

Uh oh!

vwxyzjn commented Apr 17, 2024

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ejmejm commented Apr 13, 2024

Uh oh!

vwxyzjn commented Apr 17, 2024

Uh oh!

vwxyzjn commented Apr 17, 2024

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants