Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl/rccl integration #469

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

seagater
Copy link
Contributor

@seagater seagater commented Feb 25, 2025

Use dlopen to load nccl/rccl apis from shared library to replace Fallback code for Allgather, Allreduce, Broadcast, ReduceScatter.

Add two related environment variables
-x MSCCLPP_ENABLE_SHARED_LIB=TRUE -x MSCCLPP_NCCL_LIB_PATH=path_to_libnccl.so/librccl.so

@SreevatsaAnantharamu
Copy link
Contributor

SreevatsaAnantharamu commented Feb 25, 2025

image

Additional symbols that need to be loaded from libnccl.so:

  • ncclGetUniqueId
  • ncclCommInitRank

nccl_ops_t->ncclGetUniqueId(

Here, before returning, you have to call nccl_ops_t->ncclCommInitRank and create a new real NCCL's communicator. Inside MSCCL++'s ncclComm_t, you can have a void * or ncclComm_t nccl_comm.

nccl_ops_t->ncclCommInitRank(&commPtr->nccl_comm, ... )

@seagater
Copy link
Contributor Author

seagater commented Mar 5, 2025

Add two related environment variables:
-x MSCCLPP_ENABLE_SHARED_LIB=TRUE -x MSCCLPP_NCCL_LIB_PATH=path_to_libnccl.so/librccl.so

Support dlopen for following nccl apis:
ncclCommInitRank
ncclGetUniqueId
ncclCommDestroy
ncclCommUserRank
ncclAllGather
ncclAllReduce
ncclBroadcast

Pass following tests
nccl-test:
Allgather, Allreduce, Broadcast, ReduceScatter

rccl-test:
Allgather, Allreduce, Broadcast, ReduceScatter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants