JIT compile interleaved_scan_kernel for CUDA 13#1405
Merged
rapids-bot[bot] merged 105 commits intorapidsai:mainfrom Feb 14, 2026
Merged
JIT compile interleaved_scan_kernel for CUDA 13#1405rapids-bot[bot] merged 105 commits intorapidsai:mainfrom
interleaved_scan_kernel for CUDA 13#1405rapids-bot[bot] merged 105 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
KyleFromNVIDIA
requested changes
Oct 6, 2025
interleaved_scan_kernel and LTO device functions
interleaved_scan_kernel and LTO device functionsinterleaved_scan_kernel with LTO of device functions
Member
Author
|
Benchmark:
@tfeher as requested, I ran the JITed kernels on a small batch size with a warmup time of 4s. The results are practically the same |
…uvs into jit-lto-ivf-flat-interleaved
lowener
reviewed
Feb 4, 2026
…uvs into jit-lto-ivf-flat-interleaved
dantegd
requested changes
Feb 11, 2026
Member
dantegd
left a comment
There was a problem hiding this comment.
Mainly one bug in the norm computation, and a couple other minor comments/questions
cpp/src/neighbors/ivf_flat/jit_lto_kernels/ivf_flat_interleaved_scan_kernel.cuh
Outdated
Show resolved
Hide resolved
dantegd
approved these changes
Feb 12, 2026
lowener
approved these changes
Feb 13, 2026
robertmaynard
approved these changes
Feb 13, 2026
Member
Author
|
/merge |
This was referenced Feb 26, 2026
rapids-bot bot
pushed a commit
that referenced
this pull request
Mar 19, 2026
…wheels against mix of CTK versions (#1862) The changes from #1405 introduced linking against nvJitLink. nvJitLink has versioned symbols that are added in each new CTK release, and some of those are exposed in `libcuvs.so`. `libcuvs` wheels are built against the latest CTK supported in RAPIDS (CUDA 13.1.1 as of this writing), so when those wheels are used in environments with older nvJitLink, runtime errors like this can happen: > libcugraph.so: undefined symbol: __nvJitLinkGetErrorLog_13_1, version libnvJitLink.so.13 For more details, see rapidsai/cugraph#5443 This tries to fix that. Contributes to rapidsai/build-planning#257 * builds CUDA 13 wheels with the 13.0 CTK * ensures CUDA 13 wheels ship with a runtime dependency of `nvidia-nvjitlink>={whatever-minor-version-they-were-built-against}` Contributes to rapidsai/build-planning#256 * updates wheel tests to cover a range of CTK versions (we previously, accidentally, were only testing the latest 12.x and 13.x) Other changes * ensures conda packages also take on floors of `libnvjitlink>={whatever-minor-version-they-were-built-against}` ## Notes for Reviewers ### How I tested this This uses wheels from similar PRs from RAPIDS dependencies, at build and test time: * rapidsai/raft#2971 * rapidsai/rmm#2270 * rapidsai/ucxx#604 ### Other Options 1. avoiding those versioned symbols with a build-time shim (#1855 does this, but hasn't been successful yet) 2. statically linking libnvJitLink (hasn't been successful yet) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Gil Forsyth (https://github.com/gforsyth) URL: #1862
jrbourbeau
pushed a commit
to jrbourbeau/cuvs
that referenced
this pull request
Mar 25, 2026
…wheels against mix of CTK versions (rapidsai#1862) The changes from rapidsai#1405 introduced linking against nvJitLink. nvJitLink has versioned symbols that are added in each new CTK release, and some of those are exposed in `libcuvs.so`. `libcuvs` wheels are built against the latest CTK supported in RAPIDS (CUDA 13.1.1 as of this writing), so when those wheels are used in environments with older nvJitLink, runtime errors like this can happen: > libcugraph.so: undefined symbol: __nvJitLinkGetErrorLog_13_1, version libnvJitLink.so.13 For more details, see rapidsai/cugraph#5443 This tries to fix that. Contributes to rapidsai/build-planning#257 * builds CUDA 13 wheels with the 13.0 CTK * ensures CUDA 13 wheels ship with a runtime dependency of `nvidia-nvjitlink>={whatever-minor-version-they-were-built-against}` Contributes to rapidsai/build-planning#256 * updates wheel tests to cover a range of CTK versions (we previously, accidentally, were only testing the latest 12.x and 13.x) Other changes * ensures conda packages also take on floors of `libnvjitlink>={whatever-minor-version-they-were-built-against}` This uses wheels from similar PRs from RAPIDS dependencies, at build and test time: * rapidsai/raft#2971 * rapidsai/rmm#2270 * rapidsai/ucxx#604 1. avoiding those versioned symbols with a build-time shim (rapidsai#1855 does this, but hasn't been successful yet) 2. statically linking libnvJitLink (hasn't been successful yet) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Gil Forsyth (https://github.com/gforsyth) URL: rapidsai#1862
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Closes #1520
This PR introduces infrastructure to generate LTO-IR representations of kernels at compile time. At runtime, this LTO-IR representations are JITted to SASS using the architecture of the native GPU with LTO (Link Time Optimizations) of device functions called within the kernel.
The changes in this PR can be divided in 4 parts:
nvJitLinkas a CUDA 13 dependencyinterleaved_scan_kernel, where we use a Python script to generate the files that are using for first compiling and then for embedding fatbinsBinary sizes of
interleaved_scan_*.cuTUs in CUDA 13:Total size of aforementioned TUs: 50.293 MB
Total size of
libcuvs.soon 27/10/2025 in main 468.79 MBTotal size of
libcuvs.soin this PR: 444.70 MB