Skip to content

[trainer] feat: make reward loop disrm default#4466

Merged
wuxibin89 merged 7 commits intoverl-project:mainfrom
yyDing1:make_rewardloop_disrm_default
Dec 15, 2025
Merged

[trainer] feat: make reward loop disrm default#4466
wuxibin89 merged 7 commits intoverl-project:mainfrom
yyDing1:make_rewardloop_disrm_default

Conversation

@yyDing1
Copy link
Collaborator

@yyDing1 yyDing1 commented Dec 9, 2025

What does this PR do?

  • Make reward loop disrm default, user can specify reward_model.use_reward_loop=True to enable and "False" to disable (True as default).
  • Architecture design as follows:
image

I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red):

image

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the reward model handling to make the reward loop the default implementation, which is a significant architectural improvement. The changes include new example scripts, updates to default configurations, and the core logic modifications in the trainer and reward manager. My review identified a critical logic issue in the initialization of the RewardLoopManager for synchronous rollout scenarios, which could lead to a runtime error. A code suggestion is provided to address this. The rest of the changes appear to be correct and well-implemented.

@lizipao
Copy link

lizipao commented Dec 10, 2025

I want to know whether reward_model.use_reward_loop will affect the reward results.

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Dec 10, 2025

The comparison between reward results have been shown above, I also observe the accuracy in test set as follows:
image
image

reward loop disrm results in yellow

@yyDing1 yyDing1 mentioned this pull request Dec 10, 2025
@lizipao
Copy link

lizipao commented Dec 10, 2025

My test results are consistent with yours, but I don’t understand why the reward loop disrm results are better—aren’t the compute_scores all the same?

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Dec 10, 2025

Yes, they are essentially the same, and the train-time metrics are quite similar. Variance in the test set can be more unstable than that of the training. Another possible explanation is that the legacy DisRM experiment above was run in bf16 precision, and perhaps the vLLM inference engine has some optimizations for that.

I didn’t run the full training. Are you observing that reward loop disrm consistently performs better throughout the entire training? @lizipao

@lizipao
Copy link

lizipao commented Dec 10, 2025

#3407
That might be the reason, but I'm not sure.

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Dec 10, 2025

Do you mean that using math_verify during evaluation could cause higher performance when the evaluation is run asynchronously?
In my experiments above, I do not use math_verify.
For training, disrm will override rule-based reward.
For testing, the experiments listed above use the default reward function defined as follows:
https://github.com/volcengine/verl/blob/d66120d7705989f9fda71b1fb3b45cba250e68c4/verl/utils/reward_score/__init__.py#L44-L51

One more thing, in the experiments above, the precision of reward scores in reward loop is fp32 while that in legacy disrm is bf16. But I don't think this should have any impact.

But whatever, maybe it’s time we finally embrace the server-mode reward model due to the higher performance :)

@lizipao
Copy link

lizipao commented Dec 10, 2025

I've looked at the current code. When using reward loop disrm, the reward is still computed in ray_trainer.py, right? I recall that in a multi-machine setup, this is quite inefficient because the reward is only calculated on the main node. However, if the reward is handled in agent_loop, it can be distributed to other machines, which could significantly improve efficiency, correct?

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Dec 10, 2025

  • For colocate mode: rewardloopmanager will be launched, with multiple rewardloopworkers, which distribute across multiple machines (the same manner as agentloopworker), users can set reward_model.reward_workers to specify the parallelism num.
  • For standalone mode: launch reward loop worker for each agentloopworker

The arch design is shown in the figure above.

So whatever mode, reward loop will launch multiple workers to handle incoming reward computation requests. (1) multiple workers are distributed across nodes (2) incoming requests are chunked and dispatched to workers.

@yyDing1 yyDing1 force-pushed the make_rewardloop_disrm_default branch from 47eef1f to 9f0fbec Compare December 10, 2025 09:13
@lizipao
Copy link

lizipao commented Dec 10, 2025

Ok, I understand,Thanks

@wuxibin89 wuxibin89 merged commit 7eb030e into verl-project:main Dec 15, 2025
190 of 214 checks passed
@yyDing1 yyDing1 deleted the make_rewardloop_disrm_default branch December 16, 2025 17:01
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

- Make reward loop disrm default, user can specify
`reward_model.use_reward_loop=True` to enable and "False" to disable
(True as default).
- Architecture design as follows:

<img width="910" height="631" alt="image"
src="https://github.com/user-attachments/assets/767c7413-2b52-4759-99b1-a44c0bcf8989"
/>

I have also tested the precision and find the gap between legacy fsdp
disrm and reward loop disrm acceptable, with results as follows (legacy
disrm in orange and reward loop disrm in red):

<img width="784" height="634" alt="image"
src="https://github.com/user-attachments/assets/68d60ec3-0ba1-4a2c-9f66-97e0b490c9da"
/>


### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
### What does this PR do?

- Make reward loop disrm default, user can specify
`reward_model.use_reward_loop=True` to enable and "False" to disable
(True as default).
- Architecture design as follows:

<img width="910" height="631" alt="image"
src="https://github.com/user-attachments/assets/767c7413-2b52-4759-99b1-a44c0bcf8989"
/>

I have also tested the precision and find the gap between legacy fsdp
disrm and reward loop disrm acceptable, with results as follows (legacy
disrm in orange and reward loop disrm in red):

<img width="784" height="634" alt="image"
src="https://github.com/user-attachments/assets/68d60ec3-0ba1-4a2c-9f66-97e0b490c9da"
/>


### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants