Update wordle.py example with masking of env tokens#4895
Update wordle.py example with masking of env tokens#4895sergiopaniego merged 18 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
trl/trainer/grpo_trainer.py
Outdated
| else: | ||
| tool_mask = None | ||
|
|
||
| env_mask = extra_fields.pop("env_mask", None) |
There was a problem hiding this comment.
would something like this work?:
tool_mask = extra_fields.pop("env_mask", None)these tokens are basically processed similarly, maybe it can simplify things
|
I've reviewed the rest of the scripts+notebooks using OpenEnv and they seem to be ok since they add the env feedback to the prompt. I still need to upload the updated Wordle notebook (wip) |
|
@albertvillanova moved everything related to vLLM in a separate class in #4700, you'll have a to deal with some conflicts 😬 |
Thanks for the ping, @qgallouedec! @sergiopaniego, since I am familiar with the changes introduced in #4700, I can take care of resolving the conflicts in this PR if that helps. 🤗 |
|
I'd really appreciate some help with that @albertvillanova!! 😄 |
|
Done! |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks! Just some comments below.
examples/scripts/openenv/wordle.py
Outdated
| report_to="trackio", | ||
| log_completions=True, | ||
| report_to="wandb", |
There was a problem hiding this comment.
Why the change from trackio to wandb?
trl/experimental/openenv/utils.py
Outdated
| return sampling_params | ||
|
|
||
|
|
||
| def _build_server_generation_kwargs( |
There was a problem hiding this comment.
This new _build_server_generation_kwargs function duplicates logic from _build_colocate_sampling_params. Maybe we could refactor these to reduce duplication.
There was a problem hiding this comment.
As privately discussed, please note that
- Trainer no longer has
structured_output_regexattribute - Instead, it is an attribute of
VLLMGeneration
So, in these code lines:
trl/trl/experimental/openenv/utils.py
Lines 35 to 36 in 25c5d51
maybe better using:
if trainer.vllm_generation.structured_outputs_regex:
structured_outputs = StructuredOutputsParams(regex=trainer.vllm_generation.structured_outputs_regex)Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
There was a problem hiding this comment.
Everything related to GRPO and vLLM lgtm! nice work @sergiopaniego!
|
Results using trackio in with the latest changes: https://huggingface.co/spaces/sergiopaniego/Wordle-GRPO TODO: Upload the updated wordle notebook |
What does this PR do?
Some results:
The
position_rewardcould be renamed to something likepartial_reward.One idea:
The model is Qwen3-1.7B and it achieves a correct reward of around 0.3. This does not mean that it answers correctly 30% of the time, but rather that, for a given answer, about 30% of the letters are correct—rather than winning 30% of the games.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
@kashif @qgallouedec