Adds reward bootstrapping to PPOTrainer#1536
Conversation
|
Hi @ejmejm, Thanks for raising this issue. This is indeed an interesting alternative design, and it makes intuitive sense. However, it might be better to use the EOS trick we found in here. The idea is to replace the scores of completions that do not end with EOS tokens with -1. In fact, the reward bootstrapping prob won't work because of how value networks are created. If the value network is randomly initialized, then the bootstrapped reward is a random score. If the value network is initialized from a reward model, the logits corresponding to the non-EOS tokens could be even misleading (for example, trained RM could have mostly negative numbers for the non-EOS tokens). WDYT?
|
|
Related #1540 |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |

The current return calculations for the PPO trainer assumes that every response terminates without truncation (https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1159).
When an episode is truncated early, it is technically correct to either drop the episode, or more commonly, bootstrap the reward with the value of the final state (e.g.
last reward <- last reward + gamma * value of final state). This change adds abootstrap_rewardsoption toPPOConfigthat enables bootstrapping when sequences do not end with an EOS token. An alternative way to implement this is to have the user pass in a list of which responses were terminated early, but I felt the option that was simpler for the user would be better.