Skip to content

feat: Enable simulated user for multi-turn GRPO#606

Closed
jialei777 wants to merge 61 commits intoNVIDIA-NeMo:mainfrom
jialei777:jialei/simulated-user
Closed

feat: Enable simulated user for multi-turn GRPO#606
jialei777 wants to merge 61 commits intoNVIDIA-NeMo:mainfrom
jialei777:jialei/simulated-user

Conversation

@jialei777
Copy link

@jialei777 jialei777 commented Jul 3, 2025

What does this PR do ?

Add an simple example on multi-turn GRPO using ADK.

Issues

List issues that this PR closes (syntax):

Usage

uv run python examples/run_grpo_unique_numbers_w_adk.py

Training reward:
image
Validation acc:
image

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@jialei777 jialei777 closed this Jul 3, 2025
@terrykong
Copy link
Collaborator

Hi @jialei777 . This is a really cool example! This flew under the radar, but did you want to contribute this back into NeMo RL?

Your example is very succinct and it would probably be a great starting example for other folks interested in multi-turn. Let us know if you're interested

@terrykong terrykong reopened this Jul 16, 2025
@jialei777
Copy link
Author

@terrykong definitely interested in contributing. The current issue is the convocation format is incorrect during "policy_generation.generate()". I guess you guys are more familiar with that? Would you help with that? What is the best way to discuss?

@terrykong
Copy link
Collaborator

Awesome

@jialei777 I created an issue where we can discuss so we don't crowd your PR. Feel free to share what issues you seem to be running into

@jialei777
Copy link
Author

@terrykong is that possible to have a meeting to discuss in detail? May be easier.

@jialei777 jialei777 force-pushed the jialei/simulated-user branch from 7e0a66e to 317eb60 Compare July 17, 2025 00:22
@jialei777 jialei777 changed the title Enable simulated user for multi-turn GRPO feat: Enable simulated user for multi-turn GRPO Jul 22, 2025
@jialei777
Copy link
Author

jialei777 commented Jul 22, 2025

hey @terrykong, I think the PR is mostly ready. But unfortunately i did not sign-off my initial commits, does that really matter for squash merge?

@jialei777 jialei777 force-pushed the jialei/simulated-user branch from e83cfbd to 1c96d74 Compare July 23, 2025 04:19
@jialei777 jialei777 marked this pull request as ready for review July 23, 2025 04:50
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jialei777 for the contribution. Could you share an example run command and the train/reward curves for this in the PR description so others can try your example?

re: DCO

unfortunately the DCO check can't run on the main branch, do you mind resolving on your branch?

The DCO job gives these instructinos:

Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~9 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin jialei/simulated-user


Could you also add some tests to the code you added? You made some of the scripts executable, but those checks would be great to have in unit tests. Also for the environment, can you check out

def test_math_env_step_basic(math_env, basic_test_data):
and write similar tests

).input_ids[0]

# Tokenize the raw content from the environment into chat format if needed
env_role = env_output.observations[i]["role"].lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SahilJain314 can you review this logic?

also, I think whatever change happens here should be copied in run_multi_turn_rollout_async so they have feature parity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already updated in run_sample_multi_turn_rollout, which is used in run_multi_turn_rollout_async. Please let me know if the logic is correct.

"matplotlib",
"plotly",
"mlflow",
"google-adk",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, how much larger does the google-adk make our images (when you include all the dependencies)?

Note that any change in pyproject.toml requires you to run uv lock to update the dependency lock file (that lock file is like a requirements.txt that pins the last known working state). A quick way to check is just building the docker image before and after and checking total image || layer size to see the delta.

At a glance, the direct dependencies of google adk are numerous so I think it would be better to add it as an "extra" under optional-dependencies like https://github.com/NVIDIA-NeMo/RL/blob/main/pyproject.toml#L51

and then adding a PY_EXECUTABLE for it and reference it in the environment:

VLLM = "uv run --locked --extra vllm"

The benefit of slimming down our core dependencies when possible makes it easier/faster for certain CI jobs to run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i also need to checkin the uv lock file?

@terrykong terrykong requested a review from SahilJain314 July 23, 2025 18:17
@jialei777
Copy link
Author

@terrykong thank you for your review. I will try to address those issue shortly.

I ran into a new issue after sync with main branch. Seems now "num_valid_samples" is bigger than batch_size (512). is that as expected, or it is a bug included recently? Here is a comparison of jobs before (red) and after (green) merge main

image

xxman-google and others added 12 commits July 23, 2025 21:02
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
)

Signed-off-by: Xuehan Xiong <xxman@google.com>
Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Xuehan Xiong xxman@google.com
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: KiddoZhu <zhaochengz@nvidia.com>
Signed-off-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Yuki Huang <scsse_hqh2016@126.com>
Signed-off-by: Shun Kiyono <shun.kiyono@sbintuitions.co.jp>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Atsunori Fujita <afujita@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: Zhaocheng Zhu <zhaochengz@nvidia.com>
Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Co-authored-by: Luis Vega <vegaluisjose@users.noreply.github.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com>
Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Anna Shors <ashors@nvidia.com>
Co-authored-by: Dheeraj Peri <peri.dheeraj@gmail.com>
Co-authored-by: Yuki Huang <scsse_hqh2016@126.com>
Co-authored-by: Shun Kiyono <shunk52@gmail.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: atfujita <40932835+AtsunoriFujita@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…o#505)

Signed-off-by: abukharin-nv <abukharin@nvidia.com>
Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jonas yang <joyang@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…A-NeMo#589)

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…o#622)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Co-authored-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
ZhiyuLi-Nvidia and others added 20 commits July 23, 2025 21:02
…ce (NVIDIA-NeMo#634)

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…-NeMo#660)

Signed-off-by: jubick1337 <mattyson.so@gmail.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…Mo#638)

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…VIDIA-NeMo#681)

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…VIDIA-NeMo#693)

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…DIA-NeMo#689)

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…tensor (NVIDIA-NeMo#704)

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Jonas yang <joyang@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Jonas Yang CN <joyang@nvidia.com>
Co-authored-by: yuki <48991475+yuki-666@users.noreply.github.com>
Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
…Mo#695)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
@jialei777 jialei777 force-pushed the jialei/simulated-user branch from 3af1a14 to 1b972ce Compare July 23, 2025 21:04
@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 23, 2025
Signed-off-by: Jialei Chen <jialeic@google.com>
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jul 23, 2025
@github-actions github-actions bot removed the CI Relating to CI label Jul 23, 2025
@jialei777
Copy link
Author

Closing this PR and creating #732 since I messed up history with rebasing...

@jialei777 jialei777 closed this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.