Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CUDA wheels to avoid statically linking CUDA components in our wheels #35

Closed
1 task
vyasr opened this issue Apr 6, 2024 · 4 comments
Closed
1 task
Assignees

Comments

@vyasr
Copy link
Contributor

vyasr commented Apr 6, 2024

In order to achieve manylinux compliance, RAPIDS wheels currently statically link all components of the CTK that they consume. This leads to heavily bloated binaries, especially when the effect is compounded across many packages. Since NVIDIA now publishes wheels containing the CUDA libraries and these libraries have been stress tested by the wheels for various deep learning frameworks (e.g. pytorch now depends on the CUDA wheels), RAPIDS should now do the same to reduce our wheel sizes. This work is a companion to #33 that should probably be tackled afterwards since #33 will reduce the scope of these changes to just the resulting C++ wheels, a meaningful reduction since multiple RAPIDS repos produce multiple wheels. While the goals of this are aligned with #33 and the approach is similar, there are some notable differences because of the way the CUDA wheels are structured. In particular, they are not really designed to be compiled against, only run against. They do generally seem to contain both includes and libraries, which is helpful, but they do not contain any CMake or other packaging metadata, nor do they contain the multiple symlinked copies of libraries (e.g. linker name->soname->library name). The latter is a fundamental limitation of wheels not supporting symlinks, but could cause issues for library discovery using standardized solutions like CMake's FindCUDAToolkit or pkg-config that rely on a specific version of those files existing (AFAICT only the SONAME is present). We should stage work on this in a way that minimizes conflicts with #31 and #33, both of which should facilitate this change. I propose the following, but all of it is open for discussion:

  1. Test dynamically linking in a build, then manually installing the CUDA wheels at runtime - Our first attempt should simply be to verify that we are able to interchange the libraries as expected. To achieve this, we will want to do the following:
    1. Pick a repo. raft is probably the best choice here since it's the main entry point for a lot of math libraries in RAPIDS (cuml, cugraph, and cuopt all use it that way) and because it only depends on rmm as a header-only library so there's minimal conflict with the ongoing work to introduce wheel interdependencies.
    2. Turn off static linking in the build, then configure auditwheel to exclude the relevant CUDA math libraries from inclusion. This can be done with the --exclude flag. The resulting wheel should be inspected to verify that all CUDA math libraries have been excluded from the build. Note that (at least for now) we want to continue statically linking the CUDA runtime. This change will likely require some CMake work to decouple static linking of cudart from the static linking of other CUDA libraries.
    3. We will then want to try installing these wheels into a new environment without the necessary CUDA libraries installed. This could be done using a container with a different CUDA version, or on a machine with CUDA drivers installed but relying on e.g. conda for installing the CUDA runtime and libraries. Attempting to import the wheel should give a linker error.
    4. We will then want to try installing the CUDA wheels and verify that we can make things work. The easiest choice at this stage will probably be to just set LD_LIBRARY_PATH
  2. Build against the CUDA wheels: In the long run, we would like to be able to build against the CUDA wheels to ensure that we see a consistent build and runtime layout of CUDA files. At present, this likely to be challenging due to some of the layout issues mentioned above. Concretely, I think that we will achieve the most benefit here if we attempt to make things work with the current layout, but do so in a way that makes it manifestly clear why the current layout is difficult to work with. We can then have a more productive discussion with the CUDA wheels teams about changes that we'd like to see (I have already started some of those discussions, but I think it'll be a lot easier to make headway when we have something concrete to discuss). With that in mind, I would suggest that at this stage we focus on writing custom CMake find modules for the CUDA libraries that work when we're building wheels. This will allow us to determine what shortcomings there are with the existing layouts.
  3. Layer on top of the C++ wheels - In the long run, RAPIDS Python packages should never need to deal directly with the CUDA wheels. Instead, they should be getting all their CUDA dependencies transitively via the C++ wheels. To achieve this goal, once we've reached this point with these wheels we should rework the above changes on top of the ongoing work to create separate C++ wheels.
  4. Figure out a suitable CUDA library loading strategy - The easiest way to make our wheels work with CUDA wheels at runtime is by setting the RPATHs to do a relative lookup of the libraries in the CUDA packages. Ideally I think we would want to push for the CUDA packages to instead manage the libraries via dynamic loading (the way I've set up the RAPIDS C++ wheels) to insulate consumers from the file layout of the wheel, the use of layered environments, etc, but that's probably not going to be an option in the near to intermediate term. Therefore, our options will likely be to set the RPATHs of our binaries directly, or to load the libraries in Python ourselves. The latter is a bit more flexible in that it would allow the potential for coexistence with system-installed CUDA libraries if desired, so for the purpose of e.g. DLFW containers we may still want to go that route. This would be the stage where we try to figure out in general the degree to which we want to support system vs pip-installed CUDA libraries when using our pip wheels.
  5. Roll out to all the libraries - Once we reach this point, we can make analogous changes to other RAPIDS packages.

DLFW / devcontainers adjustments

Preview Give feedback
@jameslamb
Copy link
Member

Putting up some of my notes from poking around at this today.

On an x86_64 machine, ran the following.

docker run \
    --rm \
    -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \
    bash

Looked for the default python interpreter's location for storing architecture-specific libraries:

python -c "import sysconfig; print(sysconfig.get_path('platlib'))"
# /pyenv/versions/3.10.14/lib/python3.10/site-packages

Installed some stuff.

python -m pip install \
    --extra-index-url https://pypi.nvidia.com \
        nvidia-cusparse-cu12 \
        nvidia-cublas-cu12 \
        nvidia-cufft-cu12 \
        'pylibraft-cu12==24.4.*'

Ok, so where did it put all those CUDA libraries?

find \
    /pyenv/versions/3.10.14/lib/python3.10/site-packages \
    -type f \
    -name 'libcu*.so*'

# /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.12
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cufft/lib/libcufft.so.11
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cufft/lib/libcufftw.so.11
# /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cusparse/lib/libcusparse.so.12

And what about libraft.so?

find \
    /pyenv/versions/3.10.14/lib/python3.10/site-packages \
    -type f \
    -name 'libraft*.so*'

# /pyenv/versions/3.10.14/lib/python3.10/site-packages/pylibraft/libraft.so

So by default, it looks like those CUDA libraries will be at this path relative to libraft.so:

$ORIGIN/../nvidia/{project}/lib/

Where {project} is the top-level name with no lib prefix, e.g. "cublas", "cufft", etc.

I saw at least one example where two libraries that depend on each other are installed together, and one has RUNPATH pointing to the other.

SITE=" /pyenv/versions/3.10.14/lib/python3.10/site-packages"

ldd ${SITE}/nvidia/cublas/lib/libcublas.so.12
# libcublasLt.so.12 => /pyenv/versions/3.10.14/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12 (0x00007f1c23f31000

readelf -d ${SITE}/nvidia/cublas/lib/libcublas.so.12 | grep PATH
#  0x000000000000001d (RUNPATH)            Library runpath: [$ORIGIN]

Tomorrow, I'll try a wheel build of raft customized as described at the top of this issue, and see what I learn.


Dumping some other links I've been consulting to get a better understanding of the difference between what happens to be true in the places I'm testing and what we can assume to be true about installation layouts generally.

rapids-bot bot pushed a commit to rapidsai/wholegraph that referenced this issue Jul 8, 2024
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #190
rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Jul 8, 2024
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #2376
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Jul 9, 2024
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)

URL: #5959
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Jul 9, 2024
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #4526
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Jul 9, 2024
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Ben Frederickson (https://github.com/benfred)

URL: #216
rapids-bot bot pushed a commit to rapidsai/wholegraph that referenced this issue Jul 15, 2024
#190 was supposed to separate static CUDA math libraries from static CUDA runtime library, but accidentally pulled the runtime along with the math libraries. The way we'd normally fix this is by creating a separate variable for the runtime. However, since this project doesn't actually use any math libraries, we can just revert the whole thing.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #192
@jameslamb jameslamb assigned KyleFromNVIDIA and unassigned vyasr and jameslamb Jul 24, 2024
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Aug 15, 2024
Use CUDA math wheels to reduce wheel size by not statically linking CUDA math libraries.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #5966
KyleFromNVIDIA added a commit to rapidsai/devcontainers that referenced this issue Aug 21, 2024
With packages depending on CUDA wheels by default, we want to disable
that behavior in devcontainers. Add a `use_cuda_wheels=false` matrix
entry.

Contributes to rapidsai/build-planning#35
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Aug 22, 2024
Use CUDA math wheels to reduce wheel size by not statically linking CUDA math libraries.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

URL: #298
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Aug 22, 2024
We want to be able to control whether or not the wheel uses the CUDA wheels. Add a `use_cuda_wheels` matrix entry to control this.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #6038
rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Aug 22, 2024
Use CUDA math wheels to reduce wheel size by not statically linking CUDA math libraries.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)
  - James Lamb (https://github.com/jameslamb)

URL: #2415
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Aug 23, 2024
Use CUDA math wheels to reduce wheel size by not statically linking CUDA math libraries.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Robert Maynard (https://github.com/robertmaynard)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Bradley Dice (https://github.com/bdice)

URL: #4621
@bdice
Copy link
Contributor

bdice commented Sep 5, 2024

We’ve mostly explored this for CUDA math library components thus far but we should do the same for nvcomp: https://pypi.org/project/nvidia-nvcomp-cu12/

RAPIDS should shift to using nvcomp wheels as a dependency of our own wheel builds of cudf and kvikio so we do not redistribute nvcomp libraries as part of the cudf and kvikio wheels.

@sisodia1701
Copy link

Using them as runtime, and will follow up with the CUDA team as the issues occur.

@vyasr
Copy link
Contributor Author

vyasr commented Oct 22, 2024

In last week's meeting we decided to hold off on calling this done because we weren't sure about the state of cugraph-ops. Based on some offline discussions on that front I think that we can close this issue. @KyleFromNVIDIA please reopen if you think that I'm missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants