Update openenv examples to use environment_factory#5235
Conversation
- catch.py: Format observations as readable text, normalize reward to 0-1, handle incomplete episodes - echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main() - wordle.py: Normalize reward to 0-1, add RichProgressCallback - sudoku.py: Fix cumulative message handling (diff-based), add board validation for move validity, add progress/hints/tried-moves to responses, add LoRA support, tune defaults for memory efficiency - vllm_generation.py: Add </tool_call> stop token for tool calling loop - grpo_trainer.py: Skip tool calls for environments that are done Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR: Update OpenEnv examples and docs to Summary
Trainer fixes (
|
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
qgallouedec
left a comment
There was a problem hiding this comment.
Massive work @sergiopaniego!!
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c84ff46a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for key, val in self._metrics[mode].items(): | ||
| avg = sum(val) / len(val) | ||
| # If a reward function returns None for all samples in a batch, its metric is NaN. Convert to None | ||
| # for clean serialization (e.g. JSON loggers crash on float NaN). | ||
| metrics[key] = None if math.isnan(avg) else avg |
There was a problem hiding this comment.
Update AsyncGRPOTrainer's log path too
/workspace/trl/AGENTS.md explicitly requires duplicated trainer logic to stay aligned, but this NaN-to-None logging fix only landed in GRPOTrainer/RLOOTrainer. The same log() block in trl/experimental/async_grpo/async_grpo_trainer.py:543-550 still averages all-None rewards to nan, so async GRPO runs with per-environment reward functions can still hand non-serializable metrics to JSON loggers. That leaves the exact crash this patch is fixing in one trainer variant and introduces the inconsistency the repo guidelines call out.
Useful? React with 👍 / 👎.
| Returns: | ||
| The updated page observation. | ||
| """ | ||
| return self._do_action(f"fill('{bid}', '{text}')") |
There was a problem hiding this comment.
Escape BrowserGym text before building action strings
When a task needs to type text containing an apostrophe, this produces an invalid BrowserGym command such as fill('42', 'O'Connor'). The old rollout let the model choose its own quoting, but the new wrapper always wraps tool arguments in single quotes, so a common class of form-filling tasks now fails before the environment receives the intended text. send_keys() a few lines below has the same regression.
Useful? React with 👍 / 👎.
| Returns: | ||
| The observation after moving. | ||
| """ | ||
| action_id = 0 if direction == "left" else 2 |
There was a problem hiding this comment.
Validate Catch directions instead of defaulting to right
If the model emits any move() argument other than exactly "left"—for example a malformed tool call like direction="stay" or a capitalization mismatch—this line silently converts it into action 2 (move right). Because tool arguments are free-form strings, that turns invalid tool calls into seemingly valid trajectories and corrupts the reward signal instead of surfacing a tool error. The same mapping is duplicated in multi_env.py.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

What does this PR do?
TODO:
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Note
Low Risk
Documentation-only changes, but they significantly rewrite OpenEnv training guidance and example code; risk is mainly confusing users if the new
environment_factorypatterns or links are incorrect.Overview
Migrates OpenEnv examples to
environment_factory. The OpenEnv integration guide is heavily rewritten to center onGRPOTrainer(environment_factory=...), including updated install instructions, environment-class requirements, reward functions consuming the newenvironmentsargument, and guidance on multi-turn limits and server concurrency.Reorganizes OpenEnv docs/examples. Adds
OpenEnvto the main docs toctree under Integrations, restructuresexample_overview.mdandexamples/notebooks/README.mdto introduce dedicated OpenEnv Notebooks/OpenEnv Scripts sections, and updates the Sudoku GRPO notebook to remove the custom rollout pipeline in favor of anenvironment_factory-basedSudokuEnvplus environment-backed reward functions (and includes a multi-environment training example in the OpenEnv guide).Written by Cursor Bugbot for commit fee1876. This will update automatically on new commits. Configure here.