Skip to content

libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so#284

Merged
jameslamb merged 2 commits intorapidsai:branch-25.10from
jameslamb:devendor-libnccl
Aug 26, 2025
Merged

libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so#284
jameslamb merged 2 commits intorapidsai:branch-25.10from
jameslamb:devendor-libnccl

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Aug 25, 2025

Fixes #281

Similar to rapidsai/cuvs#827, proposes that wheels get their copy of libnccl.so at runtime from nvidia-nccl-cu{12,13} wheels, instead of vendoring a copying inside libwholegraph.

Notes for Reviewers

Benefits of these changes

  • I think this fill fix the nightly tests
    • @alexbarghi-nv investigated offline and together with some other people found that "competing libnccl between this project and pytorch" may be to blame for the failing nightly tests
  • smaller libwholegraph wheels
    • 120 MiB smaller!
    • now they can go on pypi.org without needing an exception 🎉 🎉 🎉
  • smaller disk footprint for environments
    • via avoiding multiple copies of libnccl.so
  • reduced risks of runtime bugs caused by multiple competing libnccl.so being loaded

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 25, 2025
@jameslamb jameslamb changed the title WIP: libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so Aug 25, 2025
@jameslamb
Copy link
Member Author

I think this is ready for review!

cugraph-pyg wheel tests are failing like this:

/pyenv/versions/3.10.18/lib/python3.10/site-packages/pylibcudf/__init__.py:13: in <module>
    from . import (
E   ImportError: libcudf.so: cannot open shared object file: No such file or director

(build link)

But probably because of the ongoing libcudf.so-loading issues that I think should be fixed by rapidsai/cudf#19786

echo "libwholegraph-${RAPIDS_PY_CUDA_SUFFIX} @ file://$(echo "${LIBWHOLEGRAPH_WHEELHOUSE}"/libwholegraph_*.whl)" >> "${PIP_CONSTRAINT}"

export SKBUILD_CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON;-DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE;-DCUDA_STATIC_RUNTIME=ON;-DWHOLEGRAPH_BUILD_WHEELS=ON"
export SKBUILD_CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON;-DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE;-DCUDA_STATIC_RUNTIME=ON"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related, but noticed while I was looking at logs to confirm the NCCL stuff was working correctly.

  CMake Warning:
    Manually-specified variables were not used by the project:

      CUDA_STATIC_RUNTIME
      WHOLEGRAPH_BUILD_WHEELS

(build link)

Think we still want CUDA_STATIC_RUNTIME because it could get passed through here:

add_subdirectory(pylibwholegraph/binding)

But WHOLEGRAPH_BUILD_WHEELS doesn't do anything anywhere in this project.

git grep '_BUILD_WHEEL'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb jameslamb marked this pull request as ready for review August 25, 2025 20:37
@jameslamb jameslamb requested review from a team as code owners August 25, 2025 20:37
@jameslamb
Copy link
Member Author

This will need an admin merge because of the failing check-nightly-ci test...I can do that once this has approvals.

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alexbarghi-nv
Copy link
Member

@jameslamb you have my permission to admin merge once we have packaging/CI approval

@jameslamb
Copy link
Member Author

Thank you!

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement, let's gooooo!


# PyPI limit is 100 MiB, fail CI before we get too close to that
max_allowed_size_compressed = '75M'
max_allowed_size_compressed = '10Mi'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh yeaahhh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Comment on lines +62 to +63
# PyPI limit is 100 MiB, fail CI before we get too close to that
max_allowed_size_compressed = '80Mi'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘🏻

@jameslamb
Copy link
Member Author

Thanks @gforsyth ! Ok I'm going to admin-merge this and then manually trigger the nightly tests.

@jameslamb jameslamb merged commit be71c89 into rapidsai:branch-25.10 Aug 26, 2025
48 of 50 checks passed
@jameslamb jameslamb deleted the devendor-libnccl branch August 26, 2025 20:25
@jameslamb
Copy link
Member Author

All of the wheel tests passed on the manually-triggered nightly run!

https://github.com/rapidsai/cugraph-gnn/actions/runs/17250350223

Just one conda test failed, and it looks to me like something temporary while downloading a large dataset. I just restarted it... I'm hopeful it'll pass and that nightlies will be working here again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libwholegraph: stop vendoring NCCL in wheels

4 participants

Comments