Ensuring CTK minor version compatibility for cccl.c.parallel by oleksandr-pavlyk · Pull Request #4851 · NVIDIA/cccl

oleksandr-pavlyk · 2025-05-29T18:49:36Z

Description

<nvJitLink.h> header file provides unversioned inline functions which inject versionsed symbols. For example:

static inline nvJitLinkResult nvJitLinkCreate(
  nvJitLinkHandle *handle,
  uint32_t numOptions,
  const char **options)
{
  return __nvJitLinkCreate_12_8 (handle, numOptions, options);
}

c/parallel uses unversioned symbols, but due to inlining, the object files of each TU in c/parallel contains versioned symbols, and hence the final shared library depends on the specific CTK version it was built with.

The nvJitLink.so.12 does provide unversioned symbols too, which map to versioned symbols at run-time.

(nvbench) opavlyk@ee09c48-lcedt:~/repos/cccl$ nm -D /usr/local/cuda/lib64/libnvJitLink.so.12 | grep nvJitLinkCreate
00000000004ba560 T nvJitLinkCreate@@libnvJitLink.so.12
00000000004ba660 T __nvJitLinkCreate_12_0@@libnvJitLink.so.12
00000000004ba670 T __nvJitLinkCreate_12_1@@libnvJitLink.so.12
00000000004ba680 T __nvJitLinkCreate_12_2@@libnvJitLink.so.12
00000000004ba690 T __nvJitLinkCreate_12_3@@libnvJitLink.so.12
00000000004ba6a0 T __nvJitLinkCreate_12_4@@libnvJitLink.so.12
00000000004ba6b0 T __nvJitLinkCreate_12_5@@libnvJitLink.so.12
00000000004ba6c0 T __nvJitLinkCreate_12_6@@libnvJitLink.so.12
00000000004ba6d0 T __nvJitLinkCreate_12_7@@libnvJitLink.so.12
00000000004ba6e0 T __nvJitLinkCreate_12_8@@libnvJitLink.so.12

This change replaces direct uses of #include <nvJitLink.h> with #include <nvrtc/nvjitlink_helper.h> which defines NVJITLINK_NO_INLINE before #include <nvJitLink.h> (thanks for the idea @leofang). It then simply declares unversioned symbols as extern "C".

Linking with subsequently result in using the dynamic unversioned symbols provided by nvJitLink.so.12 shared library guaranteeing CTK minor version compatibility.

I verified that gh-4845 is resolved with this change by installing cuda-parallel wheel from this PR after torch built with CTK 12.8 was installed.

(pathfinder-trouble) opavlyk@ee09c48-lcedt:~$ python
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import torch
>>> import cuda.parallel.experimental.algorithms as algorithms
>>>
>>> import ctypes
>>>
>>> lib = ctypes.cdll.LoadLibrary("libnvJitLink.so.12")
>>> lib
<CDLL 'libnvJitLink.so.12', handle 60a782c77790 at 0x76ced9bf9640>
>>> fn = lib.nvJitLinkVersion
>>> fn.restype = ctypes.c_int
>>> fn.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int)]
>>> maj = ctypes.c_int(0)
>>> min = ctypes.c_int(0)
>>>
>>>
>>> fn(maj, min)
0
>>> maj, min
(c_int(12), c_int(8))
>>> quit()

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

nvJitLink.h header file provides unversioned inline functions which inject versionsed symbols. For example: ``` static inline nvJitLinkResult nvJitLinkCreate( nvJitLinkHandle *handle, uint32_t numOptions, const char **options) { return __nvJitLinkCreate_12_8 (handle, numOptions, options); } ``` c/parallel uses unversioned symbols, but due to inlining, the object files of each TU in c/parallel contains versioned symbols, and hence the final shared library depends on the specific CTK version it was built with. The nvJitLink.so.12 does provide unversioned symbols too, which map to versioned symbols at run-time. ``` (nvbench) opavlyk@ee09c48-lcedt:~/repos/cccl$ nm -D /usr/local/cuda/lib64/libnvJitLink.so.12 | grep nvJitLinkCreate 00000000004ba560 T nvJitLinkCreate@@libnvJitLink.so.12 00000000004ba660 T __nvJitLinkCreate_12_0@@libnvJitLink.so.12 00000000004ba670 T __nvJitLinkCreate_12_1@@libnvJitLink.so.12 00000000004ba680 T __nvJitLinkCreate_12_2@@libnvJitLink.so.12 00000000004ba690 T __nvJitLinkCreate_12_3@@libnvJitLink.so.12 00000000004ba6a0 T __nvJitLinkCreate_12_4@@libnvJitLink.so.12 00000000004ba6b0 T __nvJitLinkCreate_12_5@@libnvJitLink.so.12 00000000004ba6c0 T __nvJitLinkCreate_12_6@@libnvJitLink.so.12 00000000004ba6d0 T __nvJitLinkCreate_12_7@@libnvJitLink.so.12 00000000004ba6e0 T __nvJitLinkCreate_12_8@@libnvJitLink.so.12 ``` This change replaces direct uses of `#include <nvJitLink.h>` with `#include <nvrtc/nvjitlink_helper.h>` which defines `NVJITLINK_NO_INLINE` before including <nvJitLink.h> and simply declares unversioned symbols. Linking with subsequently result in using the dynamic unversioned symbols provided by nvJitLink.so.12 library guaranteeign CTK minor version compatibility. I verified that NVIDIAgh-4845 is resolved with this change by installing cuda-parallel wheel from this PR after torch built with CTK 12.8 was installed. ``` (pathfinder-trouble) opavlyk@ee09c48-lcedt:~$ python Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> >>> import torch >>> import cuda.parallel.experimental.algorithms as algorithms >>> >>> import ctypes >>> >>> lib = ctypes.cdll.LoadLibrary("libnvJitLink.so.12") >>> lib <CDLL 'libnvJitLink.so.12', handle 60a782c77790 at 0x76ced9bf9640> >>> fn = lib.nvJitLinkVersion >>> fn.restype = ctypes.c_int >>> fn.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int)] >>> maj = ctypes.c_int(0) >>> min = ctypes.c_int(0) >>> >>> >>> fn(maj, min) 0 >>> maj, min (c_int(12), c_int(8)) >>> quit() ```

github-actions · 2025-05-29T19:28:02Z

🟩 CI finished in 36m 54s: Pass: 100%/14 | Total: 2h 09m | Avg: 9m 13s | Max: 21m 06s | Hits: 96%/328

🟩 python: Pass: 100%/12 | Total: 1h 51m | Avg: 9m 16s | Max: 21m 06s

🟩 cpu
  🟩 amd64              Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 ctk
  🟩 12.9               Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 cudacxx
  🟩 nvcc12.9           Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 cxx
  🟩 GCC13              Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 cxx_family
  🟩 GCC                Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 gpu
  🟩 rtxa6000           Pass: 100%/12  | Total:  1h 51m | Avg:  9m 16s | Max: 21m 06s
🟩 jobs
  🟩 Build cuda.cccl    Pass: 100%/2   | Total:  7m 02s | Avg:  3m 31s | Max:  3m 32s
  🟩 Build cuda.cooperative Pass: 100%/2   | Total:  6m 45s | Avg:  3m 22s | Max:  3m 26s
  🟩 Build cuda.parallel Pass: 100%/2   | Total: 16m 25s | Avg:  8m 12s | Max:  8m 16s
  🟩 Test cuda.cccl     Pass: 100%/2   | Total:  9m 42s | Avg:  4m 51s | Max:  5m 20s
  🟩 Test cuda.cooperative Pass: 100%/2   | Total: 39m 18s | Avg: 19m 39s | Max: 21m 06s
  🟩 Test cuda.parallel Pass: 100%/2   | Total: 32m 10s | Avg: 16m 05s | Max: 16m 25s
🟩 py_version
  🟩 3.10               Pass: 100%/6   | Total: 57m 58s | Avg:  9m 39s | Max: 21m 06s
  🟩 3.13               Pass: 100%/6   | Total: 53m 24s | Avg:  8m 54s | Max: 18m 12s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 17m 48s | Avg: 8m 54s | Max: 14m 48s | Hits: 96%/328

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 ctk
  🟩 12.9               Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 cudacxx
  🟩 nvcc12.9           Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 gpu
  🟩 rtx2080            Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 14m 48s | Hits:  96%/328   
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  3m 00s | Avg:  3m 00s | Max:  3m 00s | Hits:  93%/164   
  🟩 Test               Pass: 100%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s | Hits:  98%/164

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
	python
+/-	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
+/-	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 14)

#	Runner
7	`linux-amd64-cpu16`
6	`linux-amd64-gpu-rtxa6000-latest-1`
1	`linux-amd64-gpu-rtx2080-latest-1`

rwgk

Looks great to me, I'm really glad that we found this simple solution.

My suggestions are minor and optional.

rwgk · 2025-05-29T21:09:16Z

c/parallel/src/nvrtc/nvjitlink_helper.h

+
+#define NVJITLINK_NO_INLINE
+#include <nvJitLink.h>
+#undef NVJITLINK_NO_INLINE


This #undef surprises me slightly. I'd lean towards keeping it defined:

Assuming nvJitLink.h is the only code from where this define is referenced: it does not matter.

Assuming other Nvidia code also references this define: undefining here could lead to inconsistencies.

rwgk · 2025-05-29T21:17:19Z

c/parallel/src/nvrtc/nvjitlink_helper.h

+nvJitLinkResult nvJitLinkGetErrorLog(nvJitLinkHandle, char*);
+nvJitLinkResult nvJitLinkGetInfoLogSize(nvJitLinkHandle, size_t*);
+nvJitLinkResult nvJitLinkGetInfoLog(nvJitLinkHandle, char*);
+}


I looked at what's actually being used (below): 10 of 13 APIs.

I'd probably remove the 3 we're not using, with a comment:

// nvJitLink APIs used in cccl/c/parallel

2 src/nvrtc/command_list.h: nvJitLinkAddData 1 src/nvrtc/command_list.h: nvJitLinkComplete 1 src/nvrtc/command_list.h: nvJitLinkCreate 1 src/nvrtc/command_list.h: nvJitLinkDestroy 1 src/nvrtc/command_list.h: nvJitLinkGetErrorLog 1 src/nvrtc/command_list.h: nvJitLinkGetErrorLogSize 1 src/nvrtc/command_list.h: nvJitLinkGetLinkedCubin 1 src/nvrtc/command_list.h: nvJitLinkGetLinkedCubinSize 1 src/nvrtc/command_list.h: nvJitLinkGetLinkedPtx 1 src/nvrtc/command_list.h: nvJitLinkGetLinkedPtxSize

oleksandr-pavlyk requested review from gevtushenko, leofang, rwgk and shwina May 29, 2025 18:49

oleksandr-pavlyk requested a review from a team as a code owner May 29, 2025 18:49

oleksandr-pavlyk added this to CCCL May 29, 2025

oleksandr-pavlyk added cuda.compute For all items related to the cuda.parallel Python module c For all items related to the CCCL-C library labels May 29, 2025

github-project-automation bot moved this to Todo in CCCL May 29, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL May 29, 2025

kkraus14 approved these changes May 29, 2025

View reviewed changes

leofang approved these changes May 29, 2025

View reviewed changes

rwgk approved these changes May 29, 2025

View reviewed changes

shwina mentioned this pull request May 30, 2025

Move python packaging builds to 12.8, expand wheel testing in dedicated workflow. #4846

Closed

oleksandr-pavlyk merged commit 7cfeb21 into NVIDIA:main May 30, 2025
31 checks passed

github-project-automation bot moved this from In Review to Done in CCCL May 30, 2025

oleksandr-pavlyk deleted the feature/c-parallel/ctk-minor-version-compatibility branch May 30, 2025 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring CTK minor version compatibility for cccl.c.parallel#4851

Ensuring CTK minor version compatibility for cccl.c.parallel#4851
oleksandr-pavlyk merged 1 commit intoNVIDIA:mainfrom
oleksandr-pavlyk:feature/c-parallel/ctk-minor-version-compatibility

oleksandr-pavlyk commented May 29, 2025

Uh oh!

github-actions bot commented May 29, 2025

🟩 python: Pass: 100%/12 | Total: 1h 51m | Avg: 9m 16s | Max: 21m 06s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 17m 48s | Avg: 8m 54s | Max: 14m 48s | Hits: 96%/328

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 14)

Uh oh!

rwgk left a comment

Uh oh!

rwgk May 29, 2025

Uh oh!

rwgk May 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

oleksandr-pavlyk commented May 29, 2025

Description

Checklist

Uh oh!

github-actions bot commented May 29, 2025

🟩 python: Pass: 100%/12 | Total: 1h 51m | Avg: 9m 16s | Max: 21m 06s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 17m 48s | Avg: 8m 54s | Max: 14m 48s | Hits: 96%/328

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 14)

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk May 29, 2025

Choose a reason for hiding this comment

Uh oh!

rwgk May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants