Skip to content

Update openenv examples to use environment_factory#5235

Merged
sergiopaniego merged 30 commits into
mainfrom
update-openenv-examples
Mar 23, 2026
Merged

Update openenv examples to use environment_factory#5235
sergiopaniego merged 30 commits into
mainfrom
update-openenv-examples

Conversation

@sergiopaniego

@sergiopaniego sergiopaniego commented Mar 6, 2026

Copy link
Copy Markdown
Member

What does this PR do?

TODO:

  • Migrate notebooks
  • Update TRL-OpenEnv guide
  • Add multi-env example

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?


Note

Low Risk
Documentation-only changes, but they significantly rewrite OpenEnv training guidance and example code; risk is mainly confusing users if the new environment_factory patterns or links are incorrect.

Overview
Migrates OpenEnv examples to environment_factory. The OpenEnv integration guide is heavily rewritten to center on GRPOTrainer(environment_factory=...), including updated install instructions, environment-class requirements, reward functions consuming the new environments argument, and guidance on multi-turn limits and server concurrency.

Reorganizes OpenEnv docs/examples. Adds OpenEnv to the main docs toctree under Integrations, restructures example_overview.md and examples/notebooks/README.md to introduce dedicated OpenEnv Notebooks/OpenEnv Scripts sections, and updates the Sudoku GRPO notebook to remove the custom rollout pipeline in favor of an environment_factory-based SudokuEnv plus environment-backed reward functions (and includes a multi-environment training example in the OpenEnv guide).

Written by Cursor Bugbot for commit fee1876. This will update automatically on new commits. Configure here.

sergiopaniego and others added 10 commits March 4, 2026 10:12
- catch.py: Format observations as readable text, normalize reward to 0-1,
  handle incomplete episodes
- echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main()
- wordle.py: Normalize reward to 0-1, add RichProgressCallback
- sudoku.py: Fix cumulative message handling (diff-based), add board
  validation for move validity, add progress/hints/tried-moves to responses,
  add LoRA support, tune defaults for memory efficiency
- vllm_generation.py: Add </tool_call> stop token for tool calling loop
- grpo_trainer.py: Skip tool calls for environments that are done

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sergiopaniego

Copy link
Copy Markdown
Member Author

PR: Update OpenEnv examples and docs to environment_factory generated by Claude

Summary

  • Migrated all OpenEnv example scripts and notebooks from rollout_func to environment_factory: sudoku, catch, browsergym_llm (scripts) and wordle, sudoku (notebooks)
  • New multi_env.py example script demonstrating multi-environment GRPO training (Wordle + Catch in the same run) with per-environment reward functions returning None for non-applicable environments
  • Rewrote docs/source/openenv.md entirely around environment_factory: environment class pattern, tool discovery via introspection, Echo and Wordle tutorials, multi-environment section, server concurrency configuration, and a migration guide from rollout_func
  • Reorganized example_overview.md and examples/notebooks/README.md with dedicated OpenEnv subsections
  • Updated all OpenEnv documentation URLs to current paths (from_hub removed, .html extensions)

Trainer fixes (grpo_trainer.py)

  • Fixed self.llmself.vllm_generation.llm for colocate mode in the tool-calling loop
  • Added server mode branch for max_model_len (was raising NotImplementedError)
  • Added done-check (_done/done) to stop calling tools for finished environments
  • Added size alignment between tool_mask/logprobs/completion_ids after the tool loop (re-tokenization can cause off-by-one mismatches)
  • Added truncation to max_completion_length to prevent size mismatch errors in long multi-turn episodes
  • Fixed nan metrics crashing JSON serialization in loggers (convert to None). This happens in multi-environment training where per-environment reward functions return None for non-applicable samples, resulting in nan averages

vLLM fix (vllm_generation.py)

  • Added tools parameter to generate() so colocate mode stops generation at </tool_call>, allowing the tool-calling loop to execute tools between turns
  • Passed self.tools from the trainer to enable this

Not migrated

  • browsergym.py: requires multimodal tool responses (screenshots) not yet supported in _tool_call_loop. To migrate, _tool_call_loop needs to: (1) allow tool methods to return structured content ([{"type": "image", ...}]), (2) detect multimodal returns and build proper content blocks instead of str(), and (3) run image processing (pixel_values, image_grid_thw) per tool-calling iteration, not just once at generation start
  • nemo_gym: fundamentally incompatible, external agent servers own generation
  • browsergym notebook: blocked by Playwright/greenlet server issue and model compatibility (FunctionGemma, not Qwen3). To migrate: (1) upgrade openenv-core to async v0.2.2+ so Playwright runs in a dedicated thread pool instead of conflicting with FastAPI's event loop, and (2) investigate Qwen3 compatibility or adapt the notebook to work with FunctionGemma's tool-calling format

Not in this PR

  • Upgrade to openenv-core async v0.2.2+ (server + client .sync())
  • Update HF Space URLs to official openenv org (some scripts still point to personal spaces)

Required HF Space configuration

environment_factory creates N concurrent environment instances (one per generation), each opening a WebSocket to the server. By default, OpenEnv servers only allow 1 concurrent session, so the Spaces need to be configured for concurrency:

  1. In server/environment.py, declare concurrent session support:
SUPPORTS_CONCURRENT_SESSIONS: bool = True
  1. In server/app.py, set the concurrency limit:
app = create_app(
    create_my_environment,
    MyAction,
    MyObservation,
    max_concurrent_envs=64,  # must be >= generation_batch_size
)

max_concurrent_envs should be >= per_device_train_batch_size * gradient_accumulation_steps. For example, with gradient_accumulation_steps=64 and batch size 1, you need at least 64 concurrent sessions.

This applies to all Spaces used with environment_factory: echo, wordle, catch, sudoku, browsergym_llm, carla.

@sergiopaniego sergiopaniego marked this pull request as ready for review March 13, 2026 14:47
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated
Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated
Comment thread trl/generation/vllm_generation.py Outdated
Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb
Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb Outdated
Comment thread examples/scripts/openenv/multi_env.py

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive work @sergiopaniego!!

@qgallouedec

Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c84ff46a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment on lines +2465 to +2469
for key, val in self._metrics[mode].items():
avg = sum(val) / len(val)
# If a reward function returns None for all samples in a batch, its metric is NaN. Convert to None
# for clean serialization (e.g. JSON loggers crash on float NaN).
metrics[key] = None if math.isnan(avg) else avg

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update AsyncGRPOTrainer's log path too

/workspace/trl/AGENTS.md explicitly requires duplicated trainer logic to stay aligned, but this NaN-to-None logging fix only landed in GRPOTrainer/RLOOTrainer. The same log() block in trl/experimental/async_grpo/async_grpo_trainer.py:543-550 still averages all-None rewards to nan, so async GRPO runs with per-environment reward functions can still hand non-serializable metrics to JSON loggers. That leaves the exact crash this patch is fixing in one trainer variant and introduces the inconsistency the repo guidelines call out.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @sergiopaniego Please address this

Returns:
The updated page observation.
"""
return self._do_action(f"fill('{bid}', '{text}')")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Escape BrowserGym text before building action strings

When a task needs to type text containing an apostrophe, this produces an invalid BrowserGym command such as fill('42', 'O'Connor'). The old rollout let the model choose its own quoting, but the new wrapper always wraps tool arguments in single quotes, so a common class of form-filling tasks now fails before the environment receives the intended text. send_keys() a few lines below has the same regression.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point I guess

Comment thread examples/scripts/openenv/catch.py Outdated
Returns:
The observation after moving.
"""
action_id = 0 if direction == "left" else 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate Catch directions instead of defaulting to right

If the model emits any move() argument other than exactly "left"—for example a malformed tool call like direction="stay" or a capitalization mismatch—this line silently converts it into action 2 (move right). Because tool arguments are free-form strings, that turns invalid tool calls into seemingly valid trajectories and corrupts the reward signal instead of surfacing a tool error. The same mapping is duplicated in multi_env.py.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point as well

Comment thread trl/trainer/grpo_trainer.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread examples/notebooks/openenv_sudoku_grpo.ipynb
@sergiopaniego sergiopaniego merged commit c0e3fb0 into main Mar 23, 2026
16 checks passed
@sergiopaniego sergiopaniego deleted the update-openenv-examples branch March 23, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants