Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Build and test with CUDA 13.0.0#1162

Merged
rapids-bot[bot] merged 9 commits intorapidsai:branch-0.46from
jameslamb:cuda-13.0.0
Aug 21, 2025
Merged

Build and test with CUDA 13.0.0#1162
rapids-bot[bot] merged 9 commits intorapidsai:branch-0.46from
jameslamb:cuda-13.0.0

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Aug 20, 2025

Contributes to rapidsai/build-planning#208

  • uses CUDA 13.0.0 to build and test
  • moves some dependency pins:
    • cupy: >=13.6.0

Contributes to rapidsai/build-planning#68

  • updates to CUDA 13 dependencies in fallback entries in dependencies.yaml matrices (i.e., the ones that get written to pyproject.toml in source control)

Notes for Reviewers

This switches GitHub Actions workflows to the cuda13.0 branch from here: rapidsai/shared-workflows#413

A future round of PRs will revert that back to branch-25.10, once all of RAPIDS supports CUDA 13.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 20, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb
Copy link
Member Author

Ok, 2 things we need to resolve.

Problem 1: building wheels against libucx-cu13==1.19.0 headers fails

  Compiling ucp/_libs/ucx_api.pyx because it changed.
  [1/1] Cythonizing ucp/_libs/ucx_api.pyx
  building 'ucp._libs.ucx_api' extension
  creating build/temp.linux-aarch64-cpython-313/ucp/_libs/src
  gcc -pthread -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall -fPIC -I/pyenv/versions/3.13.6/include -I/tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include -I/pyenv/versions/3.13.6/include/python3.13 -c ucp/_libs/src/c_util.c -o build/temp.linux-aarch64-cpython-313/ucp/_libs/src/c_util.o -std=c99 -Werror
  In file included from /tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include/ucs/memory/memory_type.h:12,
                   from /tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include/ucp/api/ucp_def.h:14,
                   from /tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include/ucp/api/ucp.h:12,
                   from ucp/_libs/src/c_util.h:8,
                   from ucp/_libs/src/c_util.c:6:
  /tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include/ucs/sys/compiler_def.h:81:50: error: operand of ‘?:’ changes signedness from ‘int’ to ‘long unsigned int’ due to unsignedness of other operand [-Werror=sign-compare]
     81 | #define UCS_MASK(_i)             (((_i) >= 64) ? ~0 : (UCS_BIT(_i) - 1))
        |                                                  ^~
  /tmp/pip-build-env-jzi5ae7w/normal/lib/python3.13/site-packages/libucx/include/ucp/api/ucp.h:511:31: note: in expansion of macro ‘UCS_MASK’
    511 |     UCP_DATATYPE_CLASS_MASK = UCS_MASK(UCP_DATATYPE_SHIFT) /**< Data-type class
        |                               ^~~~~~~~
  cc1: all warnings being treated as errors
  error: command '/opt/rh/gcc-toolset-14/root/usr/bin/gcc' failed with exit code 1
  error: subprocess-exited-with-error
  
  × Building wheel for ucx-py-cu13 (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

(wheel-build link)

Problem 2: tests require CUDA 13 cudf packages, which don't exist yet

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ cuda-version =13.0 * is requested and can be installed;
    └─ cudf =25.10,>=0.0.0a0 * is not installable because it requires
       └─ cuda-version >=12,<13.0a0 *, which conflicts with any installable versions previously reported.
critical libmamba Could not solve for environment specs

(conda-python-tests link)

@pentschev
Copy link
Member

I think this comes up from a place of need, but could we skip CUDA 13 for UCX-Py? The plan is to start archival of it next Monday, see #1160, which I presume will take approximately a week or two at the most, so the timing is a little bit off here. Anyway, I want us to explore if there's a way we can move forward with CUDA 13 support without UCX-Py support, since this work here is likely gonna be removed in about a week.

@jameslamb
Copy link
Member Author

explore if there's a way we can move forward with CUDA 13 support without UCX-Py support

If we're not going to release ucx-py 0.46 packages, we don't technically "need" to support CUDA 13 here.

However we DO need to get CUDA 13 support shipped in RAPIDS 25.10 (cc @robertmaynard). There are 18 (U.S.) working days left until burndown begins for the 25.10 release: https://docs.rapids.ai/maintainers/

If we wait for ucx-py to be factored out of RAPIDS and if it takes 1-2 weeks and doesn't start until Monday, that wouldn't leave much time to work through these CUDA 13 updates across the rest of RAPIDS and discover other issues like dependencies that haven't updated yet.

I tried to roughly map out the dependencies in rapidsai/build-planning#208

With that plus the list you provided in #1160, I think we're going to get stuck pretty soon without ucx-py CUDA 13 packages... because ucx-py is a dependency (directly or indirectly) of most of RAPIDS.

For example.... raft-dask has a hard runtime dependency on ucx-py and not having CUDA 13 packages for all of the packages from the raft repo blocks cuvs, cuml, cugraph, and cugraph-gnn.

Here are some options I see:

  • get ucx-py CUDA 13 support working as soon as possible
  • try to update the paths through the dependency graph that don't need ucx-py
    • e.g., only C++ builds / tests?
    • e.g., could we update all of RAFT except raft-dask?

Trying to work around ucx-py like that would be a lot more effort than otherwise going through normal build/packaging updates... but I don't know how much, and I don't know how to compare it to the amount of effort it'd take to get ucx-py building with CUDA 13.

@jameslamb
Copy link
Member Author

I've also started an internal chat thread about this, if you'd rather talk through the options there.

@bdice
Copy link
Contributor

bdice commented Aug 21, 2025

@jameslamb I recommend skipping/disabling the tests that require cuDF for now.

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ cuda-version =13.0 * is requested and can be installed;
    └─ cudf =25.10,>=0.0.0a0 * is not installable because it requires
       └─ cuda-version >=12,<13.0a0 *, which conflicts with any installable versions previously reported.

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb
Copy link
Member Author

Just pushed some changes I think will help. Just ignoring that one compiler warning from #1162 (comment), I was able to build and run most tests locally like this:

docker run \
    --rm \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:25.10-cuda13.0.0-rockylinux8-py3.13 \
    ./ci/build_wheel.sh

# NOTE: with test_wheel.sh changed to install locally-build wheel in dist/ instead of downloading it
docker run \
    --rm \
    --gpus all \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/ci-wheel:25.10-cuda13.0.0-rockylinux8-py3.13 \
    ./ci/test_wheel.sh

Just a few cupy errors because the driver on my local workstation is too old for CUDA 13. So hopefully those changes will be all we need, let's see 😅

@jameslamb jameslamb changed the title WIP: Build and test with CUDA 13.0.0 Build and test with CUDA 13.0.0 Aug 21, 2025
@jameslamb jameslamb requested review from bdice and pentschev August 21, 2025 15:16
@jameslamb jameslamb marked this pull request as ready for review August 21, 2025 15:16
@jameslamb jameslamb requested review from a team as code owners August 21, 2025 15:16
@jameslamb
Copy link
Member Author

At this point everyone involved is subscribed to notifications anyway via comments, so moved this out of draft.

@jameslamb
Copy link
Member Author

Some CUDA 13 tests are segfaulting like this:

Current thread 0x000073d0c2768b80 (most recent call first):
  File "/pyenv/versions/3.10.18/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 326 in safe_cuda_api_call
  File "/pyenv/versions/3.10.18/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 505 in __enter__
  File "/pyenv/versions/3.10.18/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 121 in ensure_context
  File "/pyenv/versions/3.10.18/lib/python3.10/contextlib.py", line 135 in __enter__
  File "/pyenv/versions/3.10.18/lib/python3.10/site-packages/numba/cuda/cudadrv/devices.py", line 231 in _require_cuda_context
...
==== backtrace (tid:    727) ====
 0  /pyenv/versions/3.10.18/lib/python3.10/site-packages/libucx/lib/libucs.so(ucs_handle_error+0x294) [0x73d0c0fb7b14]
 1  /pyenv/versions/3.10.18/lib/python3.10/site-packages/libucx/lib/libucs.so(+0x34cca) [0x73d0c0fb7cca]
 2  /pyenv/versions/3.10.18/lib/python3.10/site-packages/libucx/lib/libucs.so(+0x34f7e) [0x73d0c0fb7f7e]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x73d0c2898330]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c) [0x73d0c28f1b2c]
 5  /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e) [0x73d0c289827e]

(build link)

Seeing numba.code in the stack trace makes me think... maybe we need to add an explicit numba-cuda dependency here, similar to rapidsai/dask-cuda#1531

Comment on lines +43 to 49
# CUDA 13
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=13.0 ucx-py

# CUDA 12
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=12.9 ucx-py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# CUDA 13
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=13.0 ucx-py
# CUDA 12
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=12.9 ucx-py
# CUDA 12
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=12.9 ucx-py
# CUDA 13
conda create -n ucx -c conda-forge -c rapidsai \
cuda-version=13.0 ucx-py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep consistent ordering with text and keep chronological ordering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice asked me in rapidsai/kvikio#803 (comment) to order these types of things with newer CUDA first.

I'm indifferent about it, will let him comment here and will do whatever you two agree on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we keep it consistent with the text, it looks very strange to me having 13 come before 12, this is how it always looked in the past (11 then 12).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to shift the default (first command offered) to be the newest. The newest CUDA version will be supported for longer and we want to encourage users to adopt new versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rearrange the text as needed but should prefer 13 over 12 in ordering and in any situations where we only give one example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we then rephrase (some of) the text, if we want to encourage that we could start referring to CUDA 12 as "legacy" or something like that to provide some strong encouragement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should hold up this PR over this phrasing in documentation. Since you've both approved the PR (so approved all the other changes), I'm planning to merge this as soon as CI finishes, to keep moving forward with supporting CUDA 13 in RAPIDS.

I'd be happy to review a follow-up PR changing the text around these install instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move forward @jameslamb .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +68 to 72
# CUDA 13
pip install ucx-py-cu13

# CUDA 12
pip install ucx-py-cu12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# CUDA 13
pip install ucx-py-cu13
# CUDA 12
pip install ucx-py-cu12
# CUDA 12
pip install ucx-py-cu12
# CUDA 13
pip install ucx-py-cu13

Comment on lines +246 to 249
# CUDA 13
pip install 'libucx-cu13>=1.19.0,<1.20'

# CUDA 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# CUDA 13
pip install 'libucx-cu13>=1.19.0,<1.20'
# CUDA 12
# CUDA 12
pip install 'libucx-cu12>=1.19.0,<1.20'
# CUDA 13


# CUDA 12
pip install 'libucx-cu12>=1.16.0,<1.17'
pip install 'libucx-cu12>=1.19.0,<1.20'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install 'libucx-cu12>=1.19.0,<1.20'
pip install 'libucx-cu13>=1.19.0,<1.20'

@jameslamb jameslamb requested a review from a team as a code owner August 21, 2025 15:50
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jameslamb !

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny suggestion, otherwise LGTM

Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 21d744c into rapidsai:branch-0.46 Aug 21, 2025
53 checks passed
@jameslamb jameslamb deleted the cuda-13.0.0 branch August 21, 2025 20:09
rapids-bot bot pushed a commit that referenced this pull request Aug 29, 2025
Contributes to rapidsai/build-planning#208

#1162 temporarily removed the `cudf` test-time dependency here, because there weren't yet CUDA 13 `cudf` packages.

Those now exist (rapidsai/cudf#19768), so this restores that dependency.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1164
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants