Update openenv examples to use `environment_factory` by sergiopaniego · Pull Request #5235 · huggingface/trl

sergiopaniego · 2026-03-06T18:44:17Z

What does this PR do?

TODO:

Migrate notebooks
Update TRL-OpenEnv guide
Add multi-env example

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Note

Low Risk
Documentation-only changes, but they significantly rewrite OpenEnv training guidance and example code; risk is mainly confusing users if the new environment_factory patterns or links are incorrect.

Overview
Migrates OpenEnv examples to environment_factory. The OpenEnv integration guide is heavily rewritten to center on GRPOTrainer(environment_factory=...), including updated install instructions, environment-class requirements, reward functions consuming the new environments argument, and guidance on multi-turn limits and server concurrency.

Reorganizes OpenEnv docs/examples. Adds OpenEnv to the main docs toctree under Integrations, restructures example_overview.md and examples/notebooks/README.md to introduce dedicated OpenEnv Notebooks/OpenEnv Scripts sections, and updates the Sudoku GRPO notebook to remove the custom rollout pipeline in favor of an environment_factory-based SudokuEnv plus environment-backed reward functions (and includes a multi-environment training example in the OpenEnv guide).

^{Written by Cursor Bugbot for commit fee1876. This will update automatically on new commits. Configure here.}

- catch.py: Format observations as readable text, normalize reward to 0-1, handle incomplete episodes - echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main() - wordle.py: Normalize reward to 0-1, add RichProgressCallback - sudoku.py: Fix cumulative message handling (diff-based), add board validation for move validity, add progress/hints/tried-moves to responses, add LoRA support, tune defaults for memory efficiency - vllm_generation.py: Add </tool_call> stop token for tool calling loop - grpo_trainer.py: Skip tool calls for environments that are done Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-examples

sergiopaniego · 2026-03-13T14:46:58Z

PR: Update OpenEnv examples and docs to environment_factory generated by Claude

Summary

Migrated all OpenEnv example scripts and notebooks from rollout_func to environment_factory: sudoku, catch, browsergym_llm (scripts) and wordle, sudoku (notebooks)
New multi_env.py example script demonstrating multi-environment GRPO training (Wordle + Catch in the same run) with per-environment reward functions returning None for non-applicable environments
Rewrote docs/source/openenv.md entirely around environment_factory: environment class pattern, tool discovery via introspection, Echo and Wordle tutorials, multi-environment section, server concurrency configuration, and a migration guide from rollout_func
Reorganized example_overview.md and examples/notebooks/README.md with dedicated OpenEnv subsections
Updated all OpenEnv documentation URLs to current paths (from_hub removed, .html extensions)

Trainer fixes (`grpo_trainer.py`)

Fixed self.llm → self.vllm_generation.llm for colocate mode in the tool-calling loop
Added server mode branch for max_model_len (was raising NotImplementedError)
Added done-check (_done/done) to stop calling tools for finished environments
Added size alignment between tool_mask/logprobs/completion_ids after the tool loop (re-tokenization can cause off-by-one mismatches)
Added truncation to max_completion_length to prevent size mismatch errors in long multi-turn episodes
Fixed nan metrics crashing JSON serialization in loggers (convert to None). This happens in multi-environment training where per-environment reward functions return None for non-applicable samples, resulting in nan averages

vLLM fix (`vllm_generation.py`)

Added tools parameter to generate() so colocate mode stops generation at </tool_call>, allowing the tool-calling loop to execute tools between turns
Passed self.tools from the trainer to enable this

Not migrated

browsergym.py: requires multimodal tool responses (screenshots) not yet supported in _tool_call_loop. To migrate, _tool_call_loop needs to: (1) allow tool methods to return structured content ([{"type": "image", ...}]), (2) detect multimodal returns and build proper content blocks instead of str(), and (3) run image processing (pixel_values, image_grid_thw) per tool-calling iteration, not just once at generation start
nemo_gym: fundamentally incompatible, external agent servers own generation
browsergym notebook: blocked by Playwright/greenlet server issue and model compatibility (FunctionGemma, not Qwen3). To migrate: (1) upgrade openenv-core to async v0.2.2+ so Playwright runs in a dedicated thread pool instead of conflicting with FastAPI's event loop, and (2) investigate Qwen3 compatibility or adapt the notebook to work with FunctionGemma's tool-calling format

Not in this PR

Upgrade to openenv-core async v0.2.2+ (server + client .sync())
Update HF Space URLs to official openenv org (some scripts still point to personal spaces)

Required HF Space configuration

environment_factory creates N concurrent environment instances (one per generation), each opening a WebSocket to the server. By default, OpenEnv servers only allow 1 concurrent session, so the Spaces need to be configured for concurrency:

In server/environment.py, declare concurrent session support:

SUPPORTS_CONCURRENT_SESSIONS: bool = True

In server/app.py, set the concurrency limit:

app = create_app(
    create_my_environment,
    MyAction,
    MyObservation,
    max_concurrent_envs=64,  # must be >= generation_batch_size
)

max_concurrent_envs should be >= per_device_train_batch_size * gradient_accumulation_steps. For example, with gradient_accumulation_steps=64 and batch size 1, you need at least 64 concurrent sessions.

This applies to all Spaces used with environment_factory: echo, wordle, catch, sudoku, browsergym_llm, carla.

HuggingFaceDocBuilderDev · 2026-03-13T14:51:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…-examples

qgallouedec

Massive work @sergiopaniego!!

qgallouedec · 2026-03-21T02:56:13Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c84ff46a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T03:02:20Z

+        for key, val in self._metrics[mode].items():
+            avg = sum(val) / len(val)
+            # If a reward function returns None for all samples in a batch, its metric is NaN. Convert to None
+            # for clean serialization (e.g. JSON loggers crash on float NaN).
+            metrics[key] = None if math.isnan(avg) else avg


Update AsyncGRPOTrainer's log path too

/workspace/trl/AGENTS.md explicitly requires duplicated trainer logic to stay aligned, but this NaN-to-None logging fix only landed in GRPOTrainer/RLOOTrainer. The same log() block in trl/experimental/async_grpo/async_grpo_trainer.py:543-550 still averages all-None rewards to nan, so async GRPO runs with per-environment reward functions can still hand non-serializable metrics to JSON loggers. That leaves the exact crash this patch is fixing in one trainer variant and introduces the inconsistency the repo guidelines call out.

Useful? React with 👍 / 👎.

Yes @sergiopaniego Please address this

chatgpt-codex-connector · 2026-03-21T03:02:20Z

+            Returns:
+                The updated page observation.
+            """
+            return self._do_action(f"fill('{bid}', '{text}')")


Escape BrowserGym text before building action strings

When a task needs to type text containing an apostrophe, this produces an invalid BrowserGym command such as fill('42', 'O'Connor'). The old rollout let the model choose its own quoting, but the new wrapper always wraps tool arguments in single quotes, so a common class of form-filling tasks now fails before the environment receives the intended text. send_keys() a few lines below has the same regression.

Useful? React with 👍 / 👎.

Good point I guess

chatgpt-codex-connector · 2026-03-21T03:02:20Z

+            Returns:
+                The observation after moving.
+            """
+            action_id = 0 if direction == "left" else 2


Validate Catch directions instead of defaulting to right

If the model emits any move() argument other than exactly "left"—for example a malformed tool call like direction="stay" or a capitalization mismatch—this line silently converts it into action 2 (move right). Because tool arguments are free-form strings, that turns invalid tool calls into seemingly valid trajectories and corrupts the reward signal instead of surfacing a tool error. The same mapping is duplicated in multi_env.py.

Useful? React with 👍 / 👎.

Good point as well

…-examples

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

sergiopaniego and others added 10 commits March 4, 2026 10:12

Update openenv scripts to use environemnt_factory

2463c6b

Updated examples

a240c15

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

6a93c55

…-examples

Merge branch 'main' into update-openenv-examples

3d3222c

Update

de73f92

Notebooks and multi env updated

286d4eb

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

d7a7c9a

…-examples

OpenEnv guide updated

4e7cc85

Fixes

9704198

sergiopaniego marked this pull request as ready for review March 13, 2026 14:47

Trigger CI checks

17c1516

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated

Hard -> easy

cadf6a6

qgallouedec reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/generation/vllm_generation.py Outdated

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated

qgallouedec reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

qgallouedec reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

qgallouedec reviewed Mar 13, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

sergiopaniego added 4 commits March 18, 2026 15:52

Merge branch 'main' into update-openenv-examples

b8aca46

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

53e5b64

…-examples

update based on review

81f4d0f

code quality

1aa50e0

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb

sergiopaniego added 2 commits March 19, 2026 15:14

Remove redundant file

eefe904

Updated

ed9b4e1

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated

Updated notebooks

b9a48ff

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread examples/scripts/openenv/multi_env.py

sergiopaniego added 2 commits March 19, 2026 16:40

fix

b5d9ddf

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

a0545b0

…-examples

qgallouedec mentioned this pull request Mar 19, 2026

Rollout generation #5299

Closed

AmineDiro mentioned this pull request Mar 20, 2026

(4/5) async grpo break out of generation loop (is_done) #5321

Open

sergiopaniego mentioned this pull request Mar 20, 2026

Support multimodal tool responses in environment_factory for VLM training #5323

Merged

5 tasks

sergiopaniego and others added 4 commits March 20, 2026 17:05

Merge branch 'main' into update-openenv-examples

8292a64

Merge branch 'main' into update-openenv-examples

1072c5f

move into integration

247f207

minor things

3c84ff4

qgallouedec approved these changes Mar 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 21, 2026

View reviewed changes

sergiopaniego added 2 commits March 23, 2026 11:22

Updates based on Codex review

d95adad

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

16825e3

…-examples

cursor Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

Updates

a5944ee

cursor Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb

sergiopaniego added 2 commits March 23, 2026 12:28

Update

2e3b6fa

Update carla example

fee1876

sergiopaniego merged commit c0e3fb0 into main Mar 23, 2026
16 checks passed

sergiopaniego deleted the update-openenv-examples branch March 23, 2026 12:01

sergiopaniego mentioned this pull request Apr 21, 2026

docs(tutorial): migrate wordle-grpo from rollout_func to environment_… huggingface/OpenEnv#601

Merged

16 tasks

Conversation

sergiopaniego commented Mar 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sergiopaniego commented Mar 13, 2026

Summary

Trainer fixes (grpo_trainer.py)

vLLM fix (vllm_generation.py)

Not migrated

Not in this PR

Required HF Space configuration

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sergiopaniego commented Mar 6, 2026 •

edited by cursor Bot

Loading

Trainer fixes (`grpo_trainer.py`)

vLLM fix (`vllm_generation.py`)