Skip to content

Use NCCL wheels from PyPI for CUDA 12 builds#2629

Merged
rapids-bot[bot] merged 27 commits intorapidsai:branch-25.06from
divyegala:nccl-wheel
Apr 16, 2025
Merged

Use NCCL wheels from PyPI for CUDA 12 builds#2629
rapids-bot[bot] merged 27 commits intorapidsai:branch-25.06from
divyegala:nccl-wheel

Conversation

@divyegala
Copy link
Member

This PR attempts to de-vendor NCCL DSOs that we bundle along with our wheel builds.

@divyegala divyegala added feature request New feature or request non-breaking Non-breaking change labels Apr 8, 2025
@divyegala divyegala self-assigned this Apr 8, 2025
@divyegala divyegala requested review from a team as code owners April 8, 2025 20:55
@divyegala divyegala requested a review from gforsyth April 8, 2025 20:55
@github-actions github-actions bot removed the python label Apr 8, 2025
@divyegala divyegala requested a review from a team as a code owner April 8, 2025 22:55
@github-actions github-actions bot added the cpp label Apr 9, 2025
@divyegala divyegala requested a review from a team as a code owner April 9, 2025 21:33
@github-actions github-actions bot removed the cpp label Apr 16, 2025
Comment on lines 81 to 82
- nccl ${{ nccl_version }}
- ucxx ${{ ucxx_version }}
Copy link
Member Author

@divyegala divyegala Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libraft recipe installs the distributed component

cmake --install cpp/build --component distributed
which has a build and runtime dependency on targets from NCCL and ucxx as seen here

raft/cpp/CMakeLists.txt

Lines 392 to 406 in 902ed56

rapids_export_package(
BUILD ucxx raft-distributed-exports COMPONENTS ucxx python GLOBAL_TARGETS ucxx::ucxx ucxx::python
)
rapids_export_package(
INSTALL ucxx raft-distributed-exports COMPONENTS ucxx python GLOBAL_TARGETS ucxx::ucxx
ucxx::python
)
rapids_export_package(BUILD NCCL raft-distributed-exports)
rapids_export_package(INSTALL NCCL raft-distributed-exports)
# ucx is a requirement for raft_distributed, but its config is not safe to be found multiple times,
# so rather than exporting a package dependency on it above we rely on consumers to find it
# themselves. Once https://github.com/rapidsai/ucxx/issues/173 is resolved we can export it above
# again.
target_link_libraries(raft_distributed INTERFACE ucx::ucp ucxx::ucxx NCCL::NCCL)

I think these changes are needed and I believe the reason it did not error out until now is because distributed is a header-only target with INTERFACE dependencies so if consumer libraries brought in NCCL themselves it would be okay.

Happy to revert the changes, however, if this is considered an improper way of working with header only packages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyegala I think these changes to the rattler-build recipe are reasonable, but can we break them out into a separate PR so we've got a cleaner history of when and why this was added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this seems fine but doesn't make sense to bundle with the NCCL wheels work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the commit, opened new PR #2636

@divyegala divyegala removed the request for review from a team April 16, 2025 18:24
fi

RAPIDS_CUDA_MAJOR="${RAPIDS_CUDA_VERSION%%.*}"
if [[ ${RAPIDS_CUDA_MAJOR} == "12" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we do this only for CUDA 12?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no NCCL wheel distributions available for CUDA 11 since April 2024 and even then, no arm64 binaries. I think we'll have to continue to vendor NCCL binaries with CUDA 11.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. That is fine! I just didn't know the constraints off the top of my head.

common:
- output_types: conda
packages:
- &nccl_unsuffixed nccl>=2.19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2.19 a high enough lower bound? The latest is 2.26. I'm pretty sure we rely on newer versions than 2.19 for Blackwell support and other features. I don't know the exact bound to use here but we should do some validation with the oldest NCCL release we claim to support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think we could bump the version of NCCL as a follow-on to this PR depending on the . I just used the version constraint that was already present. I just checked our ci-wheel images and the system version on those is also 2.26.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty! If this matches other bounds already being used in RAFT, we can merge in the current state.

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@divyegala
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 022a4f1 into rapidsai:branch-25.06 Apr 16, 2025
80 checks passed
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this pull request Apr 22, 2025
Reference PR in RAFT rapidsai/raft#2629.

The size of `libcuvs-cu12` wheel goes down from about 1.2G to 854M.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #827
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CMake feature request New feature or request non-breaking Non-breaking change python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants