😷 Refactor GRPO/RLOO to isolate _generate#4114
Conversation
…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @Peter-Chou, customization is made easier with this one |
lewtun
left a comment
There was a problem hiding this comment.
LGTM once the tests pass and a question about whether we should split _generate from scoring entirely
|
@qgallouedec Yes. The original |
_generate_generate
_generate_generate
_generate_generate
_generate_generate
|
I'm quite interested in following the refactoring process, as the biggest problem for me is scaling to context sizes of up to 32K, and a lot of OOM bombs get hit right now. |
|
This refactoring is actually for enabling multi-turn RL (with tool calling) |
This PR belongs to a sequence of PR that aims to refactor the generation part of GRPO/RLOO to allow for easier customization and ultimately tool calling
Previous:
image_split_sizesin favour ofimage_grid_thw#4111Next:
_generatein GRPO/RLOO: list of ints instead of tensors #4146_generatein GRPO/RLOO: Useprompt_idsfrom generation #4152_generatein GRPO/RLOO: Rely on generator for prompt truncation #4153_generatein GRPO/RLOO: Moveforward_kwargsoutside generation method #4154_generatein GRPO/RLOO: Insert images in the prompt #4155