[trainer] feat: make reward loop disrm default#4466
[trainer] feat: make reward loop disrm default#4466wuxibin89 merged 7 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the reward model handling to make the reward loop the default implementation, which is a significant architectural improvement. The changes include new example scripts, updates to default configurations, and the core logic modifications in the trainer and reward manager. My review identified a critical logic issue in the initialization of the RewardLoopManager for synchronous rollout scenarios, which could lead to a runtime error. A code suggestion is provided to address this. The rest of the changes appear to be correct and well-implemented.
|
I want to know whether reward_model.use_reward_loop will affect the reward results. |
|
My test results are consistent with yours, but I don’t understand why the reward loop disrm results are better—aren’t the compute_scores all the same? |
|
Yes, they are essentially the same, and the train-time metrics are quite similar. Variance in the test set can be more unstable than that of the training. Another possible explanation is that the legacy DisRM experiment above was run in bf16 precision, and perhaps the vLLM inference engine has some optimizations for that. I didn’t run the full training. Are you observing that reward loop disrm consistently performs better throughout the entire training? @lizipao |
|
#3407 |
|
Do you mean that using One more thing, in the experiments above, the precision of reward scores in reward loop is fp32 while that in legacy disrm is bf16. But I don't think this should have any impact. But whatever, maybe it’s time we finally embrace the server-mode reward model due to the higher performance :) |
|
I've looked at the current code. When using reward loop disrm, the reward is still computed in ray_trainer.py, right? I recall that in a multi-machine setup, this is quite inefficient because the reward is only calculated on the main node. However, if the reward is handled in agent_loop, it can be distributed to other machines, which could significantly improve efficiency, correct? |
The arch design is shown in the figure above. So whatever mode, reward loop will launch multiple workers to handle incoming reward computation requests. (1) multiple workers are distributed across nodes (2) incoming requests are chunked and dispatched to workers. |
47eef1f to
9f0fbec
Compare
|
Ok, I understand,Thanks |
### What does this PR do? - Make reward loop disrm default, user can specify `reward_model.use_reward_loop=True` to enable and "False" to disable (True as default). - Architecture design as follows: <img width="910" height="631" alt="image" src="https://github.com/user-attachments/assets/767c7413-2b52-4759-99b1-a44c0bcf8989" /> I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red): <img width="784" height="634" alt="image" src="https://github.com/user-attachments/assets/68d60ec3-0ba1-4a2c-9f66-97e0b490c9da" /> ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do? - Make reward loop disrm default, user can specify `reward_model.use_reward_loop=True` to enable and "False" to disable (True as default). - Architecture design as follows: <img width="910" height="631" alt="image" src="https://github.com/user-attachments/assets/767c7413-2b52-4759-99b1-a44c0bcf8989" /> I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red): <img width="784" height="634" alt="image" src="https://github.com/user-attachments/assets/68d60ec3-0ba1-4a2c-9f66-97e0b490c9da" /> ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)


What does this PR do?
reward_model.use_reward_loop=Trueto enable and "False" to disable (True as default).I have also tested the precision and find the gap between legacy fsdp disrm and reward loop disrm acceptable, with results as follows (legacy disrm in orange and reward loop disrm in red):
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)