feat: Enable simulated user for multi-turn GRPO#606
feat: Enable simulated user for multi-turn GRPO#606jialei777 wants to merge 61 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
Hi @jialei777 . This is a really cool example! This flew under the radar, but did you want to contribute this back into NeMo RL? Your example is very succinct and it would probably be a great starting example for other folks interested in multi-turn. Let us know if you're interested |
|
@terrykong definitely interested in contributing. The current issue is the convocation format is incorrect during "policy_generation.generate()". I guess you guys are more familiar with that? Would you help with that? What is the best way to discuss? |
|
Awesome @jialei777 I created an issue where we can discuss so we don't crowd your PR. Feel free to share what issues you seem to be running into |
|
@terrykong is that possible to have a meeting to discuss in detail? May be easier. |
7e0a66e to
317eb60
Compare
|
hey @terrykong, I think the PR is mostly ready. But unfortunately i did not sign-off my initial commits, does that really matter for squash merge? |
e83cfbd to
1c96d74
Compare
terrykong
left a comment
There was a problem hiding this comment.
Thanks @jialei777 for the contribution. Could you share an example run command and the train/reward curves for this in the PR description so others can try your example?
re: DCO
unfortunately the DCO check can't run on the main branch, do you mind resolving on your branch?
The DCO job gives these instructinos:
Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~9 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin jialei/simulated-user
Could you also add some tests to the code you added? You made some of the scripts executable, but those checks would be great to have in unit tests. Also for the environment, can you check out
and write similar tests| ).input_ids[0] | ||
|
|
||
| # Tokenize the raw content from the environment into chat format if needed | ||
| env_role = env_output.observations[i]["role"].lower() |
There was a problem hiding this comment.
@SahilJain314 can you review this logic?
also, I think whatever change happens here should be copied in run_multi_turn_rollout_async so they have feature parity
There was a problem hiding this comment.
already updated in run_sample_multi_turn_rollout, which is used in run_multi_turn_rollout_async. Please let me know if the logic is correct.
| "matplotlib", | ||
| "plotly", | ||
| "mlflow", | ||
| "google-adk", |
There was a problem hiding this comment.
Out of curiosity, how much larger does the google-adk make our images (when you include all the dependencies)?
Note that any change in pyproject.toml requires you to run uv lock to update the dependency lock file (that lock file is like a requirements.txt that pins the last known working state). A quick way to check is just building the docker image before and after and checking total image || layer size to see the delta.
At a glance, the direct dependencies of google adk are numerous so I think it would be better to add it as an "extra" under optional-dependencies like https://github.com/NVIDIA-NeMo/RL/blob/main/pyproject.toml#L51
and then adding a PY_EXECUTABLE for it and reference it in the environment:
RL/nemo_rl/distributed/virtual_cluster.py
Line 49 in a607b41
The benefit of slimming down our core dependencies when possible makes it easier/faster for certain CI jobs to run.
There was a problem hiding this comment.
do i also need to checkin the uv lock file?
|
@terrykong thank you for your review. I will try to address those issue shortly. I ran into a new issue after sync with main branch. Seems now "num_valid_samples" is bigger than batch_size (512). is that as expected, or it is a bug included recently? Here is a comparison of jobs before (red) and after (green) merge main
|
Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
) Signed-off-by: Xuehan Xiong <xxman@google.com> Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Xuehan Xiong xxman@google.com Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: KiddoZhu <zhaochengz@nvidia.com> Signed-off-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com> Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com> Signed-off-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Yuki Huang <scsse_hqh2016@126.com> Signed-off-by: Shun Kiyono <shun.kiyono@sbintuitions.co.jp> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Atsunori Fujita <afujita@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Co-authored-by: Zhaocheng Zhu <zhaochengz@nvidia.com> Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com> Co-authored-by: Luis Vega <vegaluisjose@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Dheeraj Peri <peri.dheeraj@gmail.com> Co-authored-by: Yuki Huang <scsse_hqh2016@126.com> Co-authored-by: Shun Kiyono <shunk52@gmail.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: atfujita <40932835+AtsunoriFujita@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…o#505) Signed-off-by: abukharin-nv <abukharin@nvidia.com> Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jonas yang <joyang@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…A-NeMo#589) Signed-off-by: Guyue Huang <guyueh@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…o#622) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Sahil Jain <sahilj@nvidia.com> Co-authored-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…ce (NVIDIA-NeMo#634) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…-NeMo#660) Signed-off-by: jubick1337 <mattyson.so@gmail.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…Mo#638) Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…VIDIA-NeMo#681) Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…VIDIA-NeMo#693) Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…DIA-NeMo#689) Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Jialei Chen <jialeic@google.com>
…tensor (NVIDIA-NeMo#704) Signed-off-by: Sahil Jain <sahilj@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Jonas yang <joyang@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Jonas Yang CN <joyang@nvidia.com> Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com> Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…Mo#695) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
3af1a14 to
1b972ce
Compare
Signed-off-by: Jialei Chen <jialeic@google.com>
|
Closing this PR and creating #732 since I messed up history with rebasing... |

What does this PR do ?
Add an simple example on multi-turn GRPO using ADK.
Issues
List issues that this PR closes (syntax):
Usage
Training reward:


Validation acc:
Before your PR is "Ready for review"
Pre checks:
Additional Information