-
Notifications
You must be signed in to change notification settings - Fork 210
fix: convergence issue by adding use_inductor=False in vllm compilation_config #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well? Also, please attach the plots to the PR description since not everyone can access internal wandb reports. |
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find @ZhiyuLi-Nvidia !
is it possible to construct a model diagnostic test for this?
https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics
might be helpful for others who are debugging their model run
|
Thank you @parthchadha
Which model do you recommend?
Added the key screenshots. |
…lation_config Signed-off-by: Zhiyu Li <[email protected]>
6883f11 to
b3aae4f
Compare
Let's run qwen 32b from #957 (we can try with 32k osl) |
Good suggestion. Added. |
Signed-off-by: Zhiyu Li <[email protected]>
55191aa to
f5bf231
Compare
Signed-off-by: Zhiyu Li <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
|
@terrykong added output example 2ba5e3e |
Signed-off-by: Zhiyu Li <[email protected]>
@parthchadha I kept get OOM in the middle of training. Shall we go back to it once merged or in a more stable state? |
Signed-off-by: Zhiyu Li <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
1bcb7ae to
dddcbf0
Compare
Signed-off-by: Zhiyu Li <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <[email protected]>
…on_config (#1014) Signed-off-by: Zhiyu Li <[email protected]>
…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <[email protected]>
What does this PR do ?
Closes #998.
Looks like it can be resolved with the compilation flag
{"use_inductor": False}."With this flag, vllm will use the custom CUDA kernels instead of the Triton kernels generated by torch.compile "which might cause numerical issue here.
There's no logprob error spikes in 140 steps and rewards were increasing stably. The speed performance looks similar.
https://wandb.ai/nvidia/grpo-dev-zhiyul/workspace?nw=nwuserzhiyul
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information