Skip to content

Conversation

@zhandaz
Copy link
Contributor

@zhandaz zhandaz commented Jul 10, 2025

What does this PR do ?

Adds explicit NCCL_CUMEM_ENABLE=1 environment variable setting to resolve P2P initialization failures in distributed training with vLLM.

Please see detailed analysis in #564 (comment).

Issues

Closes #564.

This PR can also be helpful to #613. Maybe @YUki-666 could take a look. The only change you may need to make it to delete the os.environ["NCCL_CUMEM_ENABLE"] = "0" in function init_collective for nemo_rl/models/policy/megatron_policy_worker.py.

Test results

I have tested the settings @YUki-666 provided: 8b model grpo on 5 nodes:

image

Where

  • Purple line: exp1_5n_non_colocated_p2p_disabled: before the fix, running with NCCL_P2P_DISABLE=1.
  • Pink line: exp2_5n_non_colocated_fix: after this pr's fix.

We can see that:

  • The training reward mostly aligns.
  • The training speed is back to normal.

Usage

The fix automatically applies when using distributed training with vLLM generation workers. No user action required.

Additional Information

This PR works for both the current vllm==0.9.0 and also new versions like vllm>=0.9.1rc1.
If we upgrade our version, we can remove the additional environment variable setting in nemo_rl/models/generation/vllm_backend.py.

Signed-off-by: Zhanda Zhu <[email protected]>
@zhandaz zhandaz requested a review from yuki-97 July 10, 2025 05:57
Copy link
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @Dazz993 !

@wangshangsam
Copy link
Contributor

Hmmm ... I was looking into the mypy errors in #632 until I realized that mypy failed for this PR too. It would be hard to imagine why this PR would trigger any mypy failures.

@terrykong @parthchadha is mypy failing expected?

@terrykong
Copy link
Contributor

@wangshangsam the mypy job is expected to fail. It won't block a PR, but just as an FYI of typing issues. Once we're completely in the green, we'll change that so it gates PRs

@parthchadha parthchadha added this pull request to the merge queue Jul 10, 2025
Merged via the queue into main with commit 233cfca Jul 11, 2025
13 of 14 checks passed
@parthchadha parthchadha deleted the zhanda/fix-nccl branch July 11, 2025 01:55
guyueh1 added a commit that referenced this pull request Jul 15, 2025
ZhiyuLi-Nvidia pushed a commit that referenced this pull request Jul 21, 2025
Signed-off-by: Zhanda <[email protected]>
Signed-off-by: Zhanda Zhu <[email protected]>
Co-authored-by: Zhanda Zhu <[email protected]>
Signed-off-by: Zhiyu Li <[email protected]>
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
…#636)

Signed-off-by: Zhanda <[email protected]>
Signed-off-by: Zhanda Zhu <[email protected]>
Co-authored-by: Zhanda Zhu <[email protected]>
Signed-off-by: Jialei Chen <[email protected]>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Zhanda <[email protected]>
Signed-off-by: Zhanda Zhu <[email protected]>
Co-authored-by: Zhanda Zhu <[email protected]>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NCCL error when using non-colocated generation and set_model_state_dict apis

6 participants