-
Notifications
You must be signed in to change notification settings - Fork 203
fix: fix nccl P2P initialization error for non-colocated #636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zhanda <[email protected]>
Signed-off-by: Zhanda <[email protected]>
Signed-off-by: Zhanda Zhu <[email protected]>
yuki-97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @Dazz993 !
|
Hmmm ... I was looking into the mypy errors in #632 until I realized that mypy failed for this PR too. It would be hard to imagine why this PR would trigger any mypy failures. @terrykong @parthchadha is mypy failing expected? |
|
@wangshangsam the mypy job is expected to fail. It won't block a PR, but just as an FYI of typing issues. Once we're completely in the green, we'll change that so it gates PRs |
Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Zhanda Zhu <[email protected]> Signed-off-by: Zhiyu Li <[email protected]>
…#636) Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Zhanda Zhu <[email protected]> Signed-off-by: Jialei Chen <[email protected]>
Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Zhanda Zhu <[email protected]>
…#636) Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Zhanda Zhu <[email protected]>
What does this PR do ?
Adds explicit
NCCL_CUMEM_ENABLE=1environment variable setting to resolve P2P initialization failures in distributed training with vLLM.Please see detailed analysis in #564 (comment).
Issues
Closes #564.
This PR can also be helpful to #613. Maybe @YUki-666 could take a look. The only change you may need to make it to delete the
os.environ["NCCL_CUMEM_ENABLE"] = "0"in functioninit_collectivefornemo_rl/models/policy/megatron_policy_worker.py.Test results
I have tested the settings @YUki-666 provided: 8b model grpo on 5 nodes:
Where
exp1_5n_non_colocated_p2p_disabled: before the fix, running withNCCL_P2P_DISABLE=1.exp2_5n_non_colocated_fix: after this pr's fix.We can see that:
Usage
The fix automatically applies when using distributed training with vLLM generation workers. No user action required.
Additional Information
This PR works for both the current
vllm==0.9.0and also new versions likevllm>=0.9.1rc1.If we upgrade our version, we can remove the additional environment variable setting in
nemo_rl/models/generation/vllm_backend.py.