Skip to content

Conversation

@ZhiyuLi-Nvidia
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Aug 28, 2025

What does this PR do ?

Closes #998.

Looks like it can be resolved with the compilation flag {"use_inductor": False}.
"With this flag, vllm will use the custom CUDA kernels instead of the Triton kernels generated by torch.compile "which might cause numerical issue here.

There's no logprob error spikes in 140 steps and rewards were increasing stably. The speed performance looks similar.
https://wandb.ai/nvidia/grpo-dev-zhiyul/workspace?nw=nwuserzhiyul

  • Rewards of 140 steps
Image
  • logprob error w/ and w/o the change
image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@parthchadha
Copy link
Contributor

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well? Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find @ZhiyuLi-Nvidia !

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

@ZhiyuLi-Nvidia
Copy link
Contributor Author

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 6883f11 to b3aae4f Compare August 28, 2025 17:46
@parthchadha
Copy link
Contributor

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2025
@ZhiyuLi-Nvidia
Copy link
Contributor Author

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

Good suggestion. Added.

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 55191aa to f5bf231 Compare August 29, 2025 18:46
@ZhiyuLi-Nvidia
Copy link
Contributor Author

@terrykong added output example 2ba5e3e

Signed-off-by: Zhiyu Li <[email protected]>
@ZhiyuLi-Nvidia
Copy link
Contributor Author

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

@parthchadha I kept get OOM in the middle of training. Shall we go back to it once merged or in a more stable state?

parthchadha
parthchadha previously approved these changes Sep 2, 2025
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 2, 2025
terrykong
terrykong previously approved these changes Sep 2, 2025
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 3, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 1bcb7ae to dddcbf0 Compare September 3, 2025 17:41
@terrykong terrykong changed the title fix: fix convergence issue by adding use_inductor=False in vllm compi… fix: fix convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia changed the title fix: fix convergence issue by adding use_inductor=False in vllm compilation_config fix: convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025
@terrykong terrykong enabled auto-merge September 3, 2025 19:17
terrykong
terrykong previously approved these changes Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 3, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 3, 2025
parthchadha
parthchadha previously approved these changes Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 4, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 4, 2025
terrykong
terrykong previously approved these changes Sep 4, 2025
@terrykong terrykong added this pull request to the merge queue Sep 4, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 5, 2025
@terrykong terrykong added this pull request to the merge queue Sep 8, 2025
Merged via the queue into main with commit 1c85276 Sep 9, 2025
21 checks passed
@terrykong terrykong deleted the zhiyul/deepscaler_recipe_convergence_fix branch September 9, 2025 00:51
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
HeyyyyyyG pushed a commit that referenced this pull request Oct 3, 2025
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPO Convergence Issue with vllm cuda graph enabled

4 participants