feat: Enable simulated user for multi-turn GRPO [new]#732
feat: Enable simulated user for multi-turn GRPO [new]#732jialei777 wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Jialei Chen <jialeic@google.com>
| ).input_ids[0] | ||
|
|
||
| # Tokenize the raw content from the environment into chat format if needed | ||
| env_role = env_output.observations[i]["role"].lower() |
There was a problem hiding this comment.
@SahilJain314 would you help to take a look at the rollout logic? Here is more discussion on why i add those logic for multi-turn case
#682 (comment)
There was a problem hiding this comment.
The "<|begin_of_text|>" here seems highly specific to a particular model/tokenizer etc. Is there a way to make this general? Further, we currently handle base model training as a special case of chat-model training where the chat template is just concatenation. Does the .removeprefix here handle this case?
There was a problem hiding this comment.
What do you suggest? I cannot find a good way to handle multi-turn conversation -- removing "<|begin_of_text|>" is a hacky way to make sure the conversation history is correct, please checkout discussion here #682 (comment)
There was a problem hiding this comment.
@SahilJain314 updated with a custom chat template for multi-turn conversation. I think it is a super-hacky solution, but at least it works for both single turn and multi turn. Please let me know if you have better ideas.
There was a problem hiding this comment.
Why wouldn't this include the BOS token?
There was a problem hiding this comment.
I updated the chat template which does not add BoS by default. And the BoS is added in the first message. In this way, I won't break the single turn logic and also make sure the concatenated tokens for multi-turn is correct.
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
|
@SahilJain314 & @terrykong would you take another look and let me know if it is good to merge? |
|
@SahilJain314 & @terrykong -- gental ping |
|
@terrykong according to our discussion offline, i feel this pr is good and we just need @SahilJain314 to sign off? |
|
yea. i'd like @SahilJain314 to bless this change for the multiturn rollout |
|
@jialei777 As such, this is what we've got: do only in multiturnx = apply_chat_template(string, add_special_String=False) Here, we'd apply the chat_template as per default for the tokenizer, but then tokenize the bos string (or find the bos token id) and remove them from the front in a 'soft' way - that is - we check for their existence before we remove them and don't fail if they don't. With this, we should hit all of the following:
With this change, we should be good to merge. |
Signed-off-by: Jialei Chen <jialeic@google.com>
|
Done. @SahilJain314 would you take another look? |
Signed-off-by: Jialei Chen <jialeic@google.com>
|
@terrykong / @SahilJain314 would you take a look at this? it has been weeks... If it is not something you are interested in, I will close the PR. |
|
Hi @jialei777 . Sorry for the delay on our side. I plan to take a look at your PR later this week to shepherd it in. |
|
@jialei777 sorry for the delay. @ahmadki to help review and merge this |
|
Thank you for the PR @jialei777 , and sorry for the delay from our side. Once done, I'll confirm the convergence graph you attached and we should be good to go. |
|
hey @ahmadki, thank you so much for picking this up. Your change looks good, please feel free to merge. |
|
@ahmadki what is the status here? If still relevant, can we try to get this PR merged? |
|
Closing in favor of #1412 - should be merged as soon as I confirm the convergence graph @jialei777 Do you mind adding a Copyright headers to the new files: |
|
Closing again in favor of #1412 |
Recreating the old PR #606, since I messed up with DCO by merging main.
What does this PR do ?
Add an simple example on multi-turn GRPO using ADK.
Issues
List issues that this PR closes (syntax):
Usage
Training reward:


Validation acc:
Before your PR is "Ready for review"
Pre checks:
Additional Information