Skip to content

Update wordle.py example with masking of env tokens#4895

Merged
sergiopaniego merged 18 commits intomainfrom
updated-wordle
Feb 2, 2026
Merged

Update wordle.py example with masking of env tokens#4895
sergiopaniego merged 18 commits intomainfrom
updated-wordle

Conversation

@sergiopaniego
Copy link
Member

What does this PR do?

Some results:

Screenshot 2026-01-26 at 15 57 05 Screenshot 2026-01-26 at 15 57 16 Screenshot 2026-01-26 at 15 56 50 Screenshot 2026-01-26 at 15 56 34

The position_reward could be renamed to something like partial_reward.

One idea:
The model is Qwen3-1.7B and it achieves a correct reward of around 0.3. This does not mean that it answers correctly 30% of the time, but rather that, for a given answer, about 30% of the letters are correct—rather than winning 30% of the games.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@kashif @qgallouedec

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

else:
tool_mask = None

env_mask = extra_fields.pop("env_mask", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something like this work?:

tool_mask = extra_fields.pop("env_mask", None)

these tokens are basically processed similarly, maybe it can simplify things

@sergiopaniego
Copy link
Member Author

I've reviewed the rest of the scripts+notebooks using OpenEnv and they seem to be ok since they add the env feedback to the prompt.

I still need to upload the updated Wordle notebook (wip)

@qgallouedec
Copy link
Member

@albertvillanova moved everything related to vLLM in a separate class in #4700, you'll have a to deal with some conflicts 😬

@albertvillanova
Copy link
Member

albertvillanova commented Jan 29, 2026

@albertvillanova moved everything related to vLLM in a separate class in #4700, you'll have a to deal with some conflicts 😬

Thanks for the ping, @qgallouedec!

@sergiopaniego, since I am familiar with the changes introduced in #4700, I can take care of resolving the conflicts in this PR if that helps. 🤗

@sergiopaniego
Copy link
Member Author

I'd really appreciate some help with that @albertvillanova!! 😄

@albertvillanova
Copy link
Member

Done!

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just some comments below.

Comment on lines 457 to 531
report_to="trackio",
log_completions=True,
report_to="wandb",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from trackio to wandb?

return sampling_params


def _build_server_generation_kwargs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new _build_server_generation_kwargs function duplicates logic from _build_colocate_sampling_params. Maybe we could refactor these to reduce duplication.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As privately discussed, please note that

  • Trainer no longer has structured_output_regex attribute
  • Instead, it is an attribute of VLLMGeneration

So, in these code lines:

if trainer.structured_outputs_regex:
structured_outputs = StructuredOutputsParams(regex=trainer.structured_outputs_regex)

maybe better using:

    if trainer.vllm_generation.structured_outputs_regex:
        structured_outputs = StructuredOutputsParams(regex=trainer.vllm_generation.structured_outputs_regex)

qgallouedec and others added 2 commits January 29, 2026 10:59
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything related to GRPO and vLLM lgtm! nice work @sergiopaniego!

@sergiopaniego
Copy link
Member Author

Results using trackio in with the latest changes: https://huggingface.co/spaces/sergiopaniego/Wordle-GRPO

TODO: Upload the updated wordle notebook

@sergiopaniego sergiopaniego merged commit a03c2fc into main Feb 2, 2026
11 of 13 checks passed
@sergiopaniego sergiopaniego deleted the updated-wordle branch February 2, 2026 15:27
@cmunley1 cmunley1 mentioned this pull request Feb 6, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants