[trainer] feat: make reward loop disrm default by yyDing1 · Pull Request #4466 · verl-project/verl

yyDing1 · 2025-12-09T12:51:19Z

What does this PR do?

Make reward loop disrm default, user can specify reward_model.use_reward_loop=True to enable and "False" to disable (True as default).
Architecture design as follows:

I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red):

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request refactors the reward model handling to make the reward loop the default implementation, which is a significant architectural improvement. The changes include new example scripts, updates to default configurations, and the core logic modifications in the trainer and reward manager. My review identified a critical logic issue in the initialization of the RewardLoopManager for synchronous rollout scenarios, which could lead to a runtime error. A code suggestion is provided to address this. The rest of the changes appear to be correct and well-implemented.

verl/trainer/ppo/ray_trainer.py

lizipao · 2025-12-10T03:56:54Z

I want to know whether reward_model.use_reward_loop will affect the reward results.

yyDing1 · 2025-12-10T04:40:46Z

The comparison between reward results have been shown above, I also observe the accuracy in test set as follows:

reward loop disrm results in yellow

lizipao · 2025-12-10T05:47:53Z

My test results are consistent with yours, but I don’t understand why the reward loop disrm results are better—aren’t the compute_scores all the same?

yyDing1 · 2025-12-10T05:56:44Z

Yes, they are essentially the same, and the train-time metrics are quite similar. Variance in the test set can be more unstable than that of the training. Another possible explanation is that the legacy DisRM experiment above was run in bf16 precision, and perhaps the vLLM inference engine has some optimizations for that.

I didn’t run the full training. Are you observing that reward loop disrm consistently performs better throughout the entire training? @lizipao

lizipao · 2025-12-10T06:01:05Z

#3407
That might be the reason, but I'm not sure.

yyDing1 · 2025-12-10T06:16:43Z

Do you mean that using math_verify during evaluation could cause higher performance when the evaluation is run asynchronously?
In my experiments above, I do not use math_verify.
For training, disrm will override rule-based reward.
For testing, the experiments listed above use the default reward function defined as follows:
https://github.com/volcengine/verl/blob/d66120d7705989f9fda71b1fb3b45cba250e68c4/verl/utils/reward_score/__init__.py#L44-L51

One more thing, in the experiments above, the precision of reward scores in reward loop is fp32 while that in legacy disrm is bf16. But I don't think this should have any impact.

But whatever, maybe it’s time we finally embrace the server-mode reward model due to the higher performance :)

lizipao · 2025-12-10T08:37:39Z

I've looked at the current code. When using reward loop disrm, the reward is still computed in ray_trainer.py, right? I recall that in a multi-machine setup, this is quite inefficient because the reward is only calculated on the main node. However, if the reward is handled in agent_loop, it can be distributed to other machines, which could significantly improve efficiency, correct?

yyDing1 · 2025-12-10T08:57:45Z

For colocate mode: rewardloopmanager will be launched, with multiple rewardloopworkers, which distribute across multiple machines (the same manner as agentloopworker), users can set reward_model.reward_workers to specify the parallelism num.
For standalone mode: launch reward loop worker for each agentloopworker

The arch design is shown in the figure above.

So whatever mode, reward loop will launch multiple workers to handle incoming reward computation requests. (1) multiple workers are distributed across nodes (2) incoming requests are chunked and dispatched to workers.

lizipao · 2025-12-10T09:15:55Z

Ok, I understand,Thanks

### What does this PR do? - Make reward loop disrm default, user can specify `reward_model.use_reward_loop=True` to enable and "False" to disable (True as default). - Architecture design as follows: <img width="910" height="631" alt="image" src="https://github.com/user-attachments/assets/767c7413-2b52-4759-99b1-a44c0bcf8989" /> I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red): <img width="784" height="634" alt="image" src="https://github.com/user-attachments/assets/68d60ec3-0ba1-4a2c-9f66-97e0b490c9da" /> ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

yyDing1 added 5 commits December 9, 2025 15:17

update

7e563de

update

541ec96

update

6af157f

update

6032297

update

e5d475a

yyDing1 requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992, wuxibin89 and zw0610 as code owners December 9, 2025 12:51

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

verl/trainer/ppo/ray_trainer.py Show resolved Hide resolved

yyDing1 added 2 commits December 9, 2025 20:57

fix

a972acc

fix ci

9f0fbec

yyDing1 mentioned this pull request Dec 10, 2025

[RFC] Reward Loop #4318

Open

yyDing1 requested review from FightingZhen and ji-huazhong as code owners December 10, 2025 09:07

yyDing1 force-pushed the make_rewardloop_disrm_default branch from 47eef1f to 9f0fbec Compare December 10, 2025 09:13

wuxibin89 approved these changes Dec 15, 2025

View reviewed changes

wuxibin89 merged commit 7eb030e into verl-project:main Dec 15, 2025
190 of 214 checks passed

yyDing1 deleted the make_rewardloop_disrm_default branch December 16, 2025 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] feat: make reward loop disrm default#4466

[trainer] feat: make reward loop disrm default#4466
wuxibin89 merged 7 commits intoverl-project:mainfrom
yyDing1:make_rewardloop_disrm_default

yyDing1 commented Dec 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 •

edited

Loading

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 •

edited

Loading

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 •

edited

Loading

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 •

edited

Loading

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yyDing1 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

yyDing1 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizipao commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yyDing1 commented Dec 9, 2025 •

edited

Loading

yyDing1 commented Dec 10, 2025 •

edited

Loading

yyDing1 commented Dec 10, 2025 •

edited

Loading

yyDing1 commented Dec 10, 2025 •

edited

Loading

yyDing1 commented Dec 10, 2025 •

edited

Loading