[trainer] feat: ReMax support using reward model for baseline by HollowMan6 · Pull Request #3780 · verl-project/verl

HollowMan6 · 2025-10-15T17:34:36Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Not only limited to reward functions, we should also support using rm to calculate the reward baseline.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request adds support for using a reward model to calculate the reward baseline in ReMax. The changes are applied to the base PPO trainer and several recipe-specific trainers. While the implementation in the base trainer is correct, the recipe trainers (dapo, entropy, prime, sppo) and the split_monkey_patch example have been updated inconsistently. Specifically, they don't use the compute_reward helper function, which could lead to breakages if the reward function's API changes. My review focuses on pointing out this critical inconsistency and suggesting a fix to make the code more robust and maintainable.

examples/split_placement/split_monkey_patch.py

recipe/dapo/dapo_ray_trainer.py

recipe/entropy/entropy_ray_trainer.py

recipe/prime/prime_ray_trainer.py

recipe/sppo/sppo_ray_trainer.py

gemini-code-assist

Code Review

This pull request adds support for using a reward model for baseline calculation in ReMax. The changes are applied across several trainer implementations.

My review focuses on two main points:

Code Duplication: The logic for calculating the ReMax baseline with a reward model is duplicated across multiple files (ray_trainer.py, dapo_ray_trainer.py, entropy_ray_trainer.py, sppo_ray_trainer.py, and split_monkey_patch.py). I've left comments suggesting refactoring this into a centralized helper method to improve maintainability.
Bug in prime_ray_trainer.py: I've identified a critical bug in the ReMax baseline calculation within prime_ray_trainer.py where the return value of compute_reward is handled incorrectly, and temporary data is not cleaned up.

Overall, the feature is a good addition, but the implementation would benefit from the suggested refactoring to address the code duplication and the bug fix.

recipe/prime/prime_ray_trainer.py

examples/split_placement/split_monkey_patch.py

recipe/dapo/dapo_ray_trainer.py

recipe/entropy/entropy_ray_trainer.py

recipe/sppo/sppo_ray_trainer.py

verl/trainer/ppo/ray_trainer.py

gemini-code-assist

Code Review

This pull request adds support for using a reward model for baseline calculation in ReMax and refactors reward computation logic across several trainers. The changes are generally well-structured and reduce code duplication. However, I've identified a high-severity issue in recipe/prime/prime_ray_trainer.py where metrics from a reward model call might be lost due to incorrect handling after the refactoring. Please see the specific comment for details and a suggested fix.

recipe/prime/prime_ray_trainer.py

gemini-code-assist

Code Review

This pull request adds support for using a reward model to calculate the baseline for the ReMax advantage estimator across several trainer implementations. The changes are mostly consistent, but I've found a couple of critical issues. One is in an example file where a direct call to reward_fn is not robust and could lead to a crash. Another is in the prime_ray_trainer.py where an unsafe get call could result in a None value and cause an AttributeError. My review includes suggestions to fix these issues for improved robustness and correctness.

examples/split_placement/split_monkey_patch.py

recipe/prime/prime_ray_trainer.py

gemini-code-assist

Code Review

This pull request extends ReMax to support using a reward model for calculating the reward baseline. The changes are applied across several trainer implementations, including the main PPO trainer and specific recipes like DAPO, Entropy, and PRIME. The core logic involves computing reward model scores and using them in the baseline calculation.

The refactoring to a central compute_reward function is a good improvement. However, I've identified several high-severity issues:

In multiple files, there's a missing check to see if reward model scores (rm_scores) are already present in the batch before recomputing them. This could lead to redundant work and potential errors.
In the PRIME recipe, the new compute_reward method has a side effect of modifying its batch argument in-place, which is a risky pattern.
Also in the PRIME recipe, the logic for retrieving the reward tensor from the reward model's output is fragile and could be made more robust.

I've provided suggestions to address these issues to improve the code's correctness and maintainability.

examples/split_placement/split_monkey_patch.py

recipe/dapo/dapo_ray_trainer.py

recipe/entropy/entropy_ray_trainer.py

recipe/prime/prime_ray_trainer.py

recipe/sppo/sppo_ray_trainer.py

HollowMan6 · 2025-10-16T17:31:14Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully adds support for using a reward model as a baseline for ReMax advantage estimation across various trainers. The changes are consistent and well-implemented. A notable improvement is the refactoring of reward computation logic into a shared compute_reward function, which reduces code duplication.

However, there are two main areas for improvement. First, the logic for calculating the ReMax baseline is duplicated across five different trainer files. This should be refactored into a helper method in the base RayPPOTrainer class to improve maintainability. Second, in prime_ray_trainer.py, the use of deepcopy on a potentially large DataProto object introduces a significant performance risk. A more efficient, shallow-copy approach is recommended.

Addressing these points will make the codebase more robust and performant.

recipe/prime/prime_ray_trainer.py

verl/trainer/ppo/ray_trainer.py

Not only limited to reward functions, we should also support using rm to calculate the reward baseline. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…roject#3780) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Not only limited to reward functions, we should also support using rm to calculate the reward baseline. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 requested review from FightingZhen, PeterSH6, eric-haibin-lin, ji-huazhong, tongyx361, vermouth1992 and zhaochenyang20 as code owners October 15, 2025 17:34

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

recipe/prime/prime_ray_trainer.py Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

examples/split_placement/split_monkey_patch.py Outdated Show resolved Hide resolved

recipe/prime/prime_ray_trainer.py Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

recipe/prime/prime_ray_trainer.py Outdated Show resolved Hide resolved

verl/trainer/ppo/ray_trainer.py Show resolved Hide resolved

[trainer] feat: ReMax support using reward model for baseline

73afb15

Not only limited to reward functions, we should also support using rm to calculate the reward baseline. Signed-off-by: Hollow Man <hollowman@opensuse.org>

wuxibin89 approved these changes Oct 17, 2025

View reviewed changes

wuxibin89 merged commit ae5d850 into verl-project:main Oct 17, 2025
69 of 70 checks passed

HollowMan6 deleted the remax-reward branch October 17, 2025 06:02

Conversation

HollowMan6 commented Oct 15, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants