Skip to content

Ensuring CTK minor version compatibility for cccl.c.parallel#4851

Merged
oleksandr-pavlyk merged 1 commit intoNVIDIA:mainfrom
oleksandr-pavlyk:feature/c-parallel/ctk-minor-version-compatibility
May 30, 2025
Merged

Ensuring CTK minor version compatibility for cccl.c.parallel#4851
oleksandr-pavlyk merged 1 commit intoNVIDIA:mainfrom
oleksandr-pavlyk:feature/c-parallel/ctk-minor-version-compatibility

Conversation

@oleksandr-pavlyk
Copy link
Copy Markdown
Contributor

Description

closes gh-4845

<nvJitLink.h> header file provides unversioned inline functions which inject versionsed symbols. For example:

static inline nvJitLinkResult nvJitLinkCreate(
  nvJitLinkHandle *handle,
  uint32_t numOptions,
  const char **options)
{
  return __nvJitLinkCreate_12_8 (handle, numOptions, options);
}

c/parallel uses unversioned symbols, but due to inlining, the object files of each TU in c/parallel contains versioned symbols, and hence the final shared library depends on the specific CTK version it was built with.

The nvJitLink.so.12 does provide unversioned symbols too, which map to versioned symbols at run-time.

(nvbench) opavlyk@ee09c48-lcedt:~/repos/cccl$ nm -D /usr/local/cuda/lib64/libnvJitLink.so.12 | grep nvJitLinkCreate
00000000004ba560 T nvJitLinkCreate@@libnvJitLink.so.12
00000000004ba660 T __nvJitLinkCreate_12_0@@libnvJitLink.so.12
00000000004ba670 T __nvJitLinkCreate_12_1@@libnvJitLink.so.12
00000000004ba680 T __nvJitLinkCreate_12_2@@libnvJitLink.so.12
00000000004ba690 T __nvJitLinkCreate_12_3@@libnvJitLink.so.12
00000000004ba6a0 T __nvJitLinkCreate_12_4@@libnvJitLink.so.12
00000000004ba6b0 T __nvJitLinkCreate_12_5@@libnvJitLink.so.12
00000000004ba6c0 T __nvJitLinkCreate_12_6@@libnvJitLink.so.12
00000000004ba6d0 T __nvJitLinkCreate_12_7@@libnvJitLink.so.12
00000000004ba6e0 T __nvJitLinkCreate_12_8@@libnvJitLink.so.12

This change replaces direct uses of #include <nvJitLink.h> with #include <nvrtc/nvjitlink_helper.h> which defines NVJITLINK_NO_INLINE before #include <nvJitLink.h> (thanks for the idea @leofang). It then simply declares unversioned symbols as extern "C".

Linking with subsequently result in using the dynamic unversioned symbols provided by nvJitLink.so.12 shared library guaranteeing CTK minor version compatibility.

I verified that gh-4845 is resolved with this change by installing cuda-parallel wheel from this PR after torch built with CTK 12.8 was installed.

(pathfinder-trouble) opavlyk@ee09c48-lcedt:~$ python
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import torch
>>> import cuda.parallel.experimental.algorithms as algorithms
>>>
>>> import ctypes
>>>
>>> lib = ctypes.cdll.LoadLibrary("libnvJitLink.so.12")
>>> lib
<CDLL 'libnvJitLink.so.12', handle 60a782c77790 at 0x76ced9bf9640>
>>> fn = lib.nvJitLinkVersion
>>> fn.restype = ctypes.c_int
>>> fn.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int)]
>>> maj = ctypes.c_int(0)
>>> min = ctypes.c_int(0)
>>>
>>>
>>> fn(maj, min)
0
>>> maj, min
(c_int(12), c_int(8))
>>> quit()

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

nvJitLink.h header file provides unversioned inline functions which
inject versionsed symbols. For example:

```
static inline nvJitLinkResult nvJitLinkCreate(
  nvJitLinkHandle *handle,
  uint32_t numOptions,
  const char **options)
{
  return __nvJitLinkCreate_12_8 (handle, numOptions, options);
}
```

c/parallel uses unversioned symbols, but due to inlining, the
object files of each TU in c/parallel contains versioned symbols,
and hence the final shared library depends on the specific CTK
version it was built with.

The nvJitLink.so.12 does provide unversioned symbols too, which
map to versioned symbols at run-time.

```
(nvbench) opavlyk@ee09c48-lcedt:~/repos/cccl$ nm -D /usr/local/cuda/lib64/libnvJitLink.so.12 | grep nvJitLinkCreate
00000000004ba560 T nvJitLinkCreate@@libnvJitLink.so.12
00000000004ba660 T __nvJitLinkCreate_12_0@@libnvJitLink.so.12
00000000004ba670 T __nvJitLinkCreate_12_1@@libnvJitLink.so.12
00000000004ba680 T __nvJitLinkCreate_12_2@@libnvJitLink.so.12
00000000004ba690 T __nvJitLinkCreate_12_3@@libnvJitLink.so.12
00000000004ba6a0 T __nvJitLinkCreate_12_4@@libnvJitLink.so.12
00000000004ba6b0 T __nvJitLinkCreate_12_5@@libnvJitLink.so.12
00000000004ba6c0 T __nvJitLinkCreate_12_6@@libnvJitLink.so.12
00000000004ba6d0 T __nvJitLinkCreate_12_7@@libnvJitLink.so.12
00000000004ba6e0 T __nvJitLinkCreate_12_8@@libnvJitLink.so.12
```

This change replaces direct uses of `#include <nvJitLink.h>` with
`#include <nvrtc/nvjitlink_helper.h>` which defines `NVJITLINK_NO_INLINE`
before including <nvJitLink.h> and simply declares unversioned symbols.

Linking with subsequently result in using the dynamic unversioned
symbols provided by nvJitLink.so.12 library guaranteeign CTK minor
version compatibility.

I verified that NVIDIAgh-4845 is resolved with this change by installing
cuda-parallel wheel from this PR after torch built with CTK 12.8 was
installed.

```
(pathfinder-trouble) opavlyk@ee09c48-lcedt:~$ python
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import torch
>>> import cuda.parallel.experimental.algorithms as algorithms
>>>
>>> import ctypes
>>>
>>> lib = ctypes.cdll.LoadLibrary("libnvJitLink.so.12")
>>> lib
<CDLL 'libnvJitLink.so.12', handle 60a782c77790 at 0x76ced9bf9640>
>>> fn = lib.nvJitLinkVersion
>>> fn.restype = ctypes.c_int
>>> fn.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int)]
>>> maj = ctypes.c_int(0)
>>> min = ctypes.c_int(0)
>>>
>>>
>>> fn(maj, min)
0
>>> maj, min
(c_int(12), c_int(8))
>>> quit()
```
@oleksandr-pavlyk oleksandr-pavlyk requested a review from a team as a code owner May 29, 2025 18:49
@oleksandr-pavlyk oleksandr-pavlyk added cuda.compute For all items related to the cuda.parallel Python module c For all items related to the CCCL-C library labels May 29, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL May 29, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL May 29, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🟩 CI finished in 36m 54s: Pass: 100%/14 | Total: 2h 09m | Avg: 9m 13s | Max: 21m 06s | Hits: 96%/328
  • 🟩 python: Pass: 100%/12 | Total: 1h 51m | Avg: 9m 16s | Max: 21m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 ctk
      🟩 12.9               Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 gpu
      🟩 rtxa6000           Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
    🟩 jobs
      🟩 Build cuda.cccl    Pass: 100%/2   | Total:  7m 02s | Avg:  3m 31s | Max:  3m 32s
      🟩 Build cuda.cooperative Pass: 100%/2   | Total:  6m 45s | Avg:  3m 22s | Max:  3m 26s
      🟩 Build cuda.parallel Pass: 100%/2   | Total: 16m 25s | Avg:  8m 12s | Max:  8m 16s
      🟩 Test cuda.cccl     Pass: 100%/2   | Total:  9m 42s | Avg:  4m 51s | Max:  5m 20s
      🟩 Test cuda.cooperative Pass: 100%/2   | Total: 39m 18s | Avg: 19m 39s | Max: 21m 06s
      🟩 Test cuda.parallel Pass: 100%/2   | Total: 32m 10s | Avg: 16m 05s | Max: 16m 25s
    🟩 py_version
      🟩 3.10               Pass: 100%/6   | Total: 57m 58s | Avg:  9m 39s | Max: 21m 06s
      🟩 3.13               Pass: 100%/6   | Total: 53m 24s | Avg:  8m 54s | Max: 18m 12s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 17m 48s | Avg: 8m 54s | Max: 14m 48s | Hits: 96%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 ctk
      🟩 12.9               Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  3m 00s | Avg:  3m 00s | Max:  3m 00s | Hits:  93%/164   
      🟩 Test               Pass: 100%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s | Hits:  98%/164   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
python
+/- CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
+/- CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 14)

# Runner
7 linux-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
1 linux-amd64-gpu-rtx2080-latest-1

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, I'm really glad that we found this simple solution.

My suggestions are minor and optional.


#define NVJITLINK_NO_INLINE
#include <nvJitLink.h>
#undef NVJITLINK_NO_INLINE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #undef surprises me slightly. I'd lean towards keeping it defined:

  • Assuming nvJitLink.h is the only code from where this define is referenced: it does not matter.

  • Assuming other Nvidia code also references this define: undefining here could lead to inconsistencies.

nvJitLinkResult nvJitLinkGetErrorLog(nvJitLinkHandle, char*);
nvJitLinkResult nvJitLinkGetInfoLogSize(nvJitLinkHandle, size_t*);
nvJitLinkResult nvJitLinkGetInfoLog(nvJitLinkHandle, char*);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at what's actually being used (below): 10 of 13 APIs.

I'd probably remove the 3 we're not using, with a comment:

// nvJitLink APIs used in cccl/c/parallel

      2 src/nvrtc/command_list.h:  nvJitLinkAddData
      1 src/nvrtc/command_list.h:  nvJitLinkComplete
      1 src/nvrtc/command_list.h:  nvJitLinkCreate
      1 src/nvrtc/command_list.h:  nvJitLinkDestroy
      1 src/nvrtc/command_list.h:  nvJitLinkGetErrorLog
      1 src/nvrtc/command_list.h:  nvJitLinkGetErrorLogSize
      1 src/nvrtc/command_list.h:  nvJitLinkGetLinkedCubin
      1 src/nvrtc/command_list.h:  nvJitLinkGetLinkedCubinSize
      1 src/nvrtc/command_list.h:  nvJitLinkGetLinkedPtx
      1 src/nvrtc/command_list.h:  nvJitLinkGetLinkedPtxSize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c For all items related to the CCCL-C library cuda.compute For all items related to the cuda.parallel Python module

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG]: ImportError importing cuda.parallel in environment where torch 2.7.0 is installed

4 participants