Skip to content

JIT compile interleaved_scan_kernel for CUDA 13#1405

Merged
rapids-bot[bot] merged 105 commits intorapidsai:mainfrom
divyegala:jit-lto-ivf-flat-interleaved
Feb 14, 2026
Merged

JIT compile interleaved_scan_kernel for CUDA 13#1405
rapids-bot[bot] merged 105 commits intorapidsai:mainfrom
divyegala:jit-lto-ivf-flat-interleaved

Conversation

@divyegala
Copy link
Copy Markdown
Member

@divyegala divyegala commented Oct 2, 2025

Closes #1520

This PR introduces infrastructure to generate LTO-IR representations of kernels at compile time. At runtime, this LTO-IR representations are JITted to SASS using the architecture of the native GPU with LTO (Link Time Optimizations) of device functions called within the kernel.

The changes in this PR can be divided in 4 parts:

  1. CMake code to generate fatbins to JIT at runtime
  2. Architecture to handle JITing and caching of kernels
  3. Changes to the build system to bring in nvJitLink as a CUDA 13 dependency
  4. Using all of the above to JIT interleaved_scan_kernel, where we use a Python script to generate the files that are using for first compiling and then for embedding fatbins

Binary sizes of interleaved_scan_*.cu TUs in CUDA 13:

TU name Size
ivf_flat_interleaved_scan_half_int64_t.cu.o 5.658 MB
vf_flat_interleaved_scan_half_int64_t_bitset.cu.o 5.998 MB
ivf_flat_interleaved_scan_uint8_t_int64_t.cu.o 6.402 MB
ivf_flat_interleaved_scan_int8_t_int64_t.cu.o 6.338 MB
ivf_flat_interleaved_scan_float_int64_t_bitset.cu.o 6.273 MB
ivf_flat_interleaved_scan_float_int64_t.cu.o 5.872 MB
ivf_flat_interleaved_scan_uint8_t_int64_t_bitset.cu.o 6.851 MB
ivf_flat_interleaved_scan_int8_t_int64_t_bitset.cu.o 6.901 MB

Total size of aforementioned TUs: 50.293 MB
Total size of libcuvs.so on 27/10/2025 in main 468.79 MB
Total size of libcuvs.so in this PR: 444.70 MB

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Oct 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@divyegala divyegala changed the title jit lto interleaved scan JIT compile interleaved_scan_kernel and LTO device functions Oct 7, 2025
@divyegala divyegala changed the title JIT compile interleaved_scan_kernel and LTO device functions JIT compile interleaved_scan_kernel with LTO of device functions Oct 7, 2025
@divyegala divyegala marked this pull request as ready for review October 8, 2025 20:31
@divyegala divyegala requested review from a team as code owners October 8, 2025 20:31
@divyegala divyegala added feature request New feature or request non-breaking Introduces a non-breaking change labels Oct 8, 2025
@divyegala
Copy link
Copy Markdown
Member Author

divyegala commented Oct 9, 2025

Benchmark:

image

@tfeher as requested, I ran the JITed kernels on a small batch size with a warmup time of 4s. The results are practically the same
image

@divyegala divyegala requested a review from a team as a code owner October 9, 2025 02:55
@divyegala divyegala requested review from dantegd and lowener February 4, 2026 21:15
@divyegala divyegala requested a review from lowener February 10, 2026 20:23
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly one bug in the norm computation, and a couple other minor comments/questions

@divyegala
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit da6136e into rapidsai:main Feb 14, 2026
149 of 151 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Unstructured Data Processing Feb 14, 2026
rapids-bot bot pushed a commit that referenced this pull request Mar 19, 2026
…wheels against mix of CTK versions (#1862)

The changes from #1405 introduced linking against nvJitLink. nvJitLink has versioned symbols that are added in each new CTK release, and some of those are exposed in `libcuvs.so`.

`libcuvs` wheels are built against the latest CTK supported in RAPIDS (CUDA 13.1.1 as of this writing), so when those wheels are used in environments with older nvJitLink, runtime errors like this can happen:

> libcugraph.so: undefined symbol: __nvJitLinkGetErrorLog_13_1, version libnvJitLink.so.13

For more details, see rapidsai/cugraph#5443

This tries to fix that.

Contributes to rapidsai/build-planning#257

* builds CUDA 13 wheels with the 13.0 CTK
* ensures CUDA 13 wheels ship with a runtime dependency of `nvidia-nvjitlink>={whatever-minor-version-they-were-built-against}`

Contributes to rapidsai/build-planning#256

* updates wheel tests to cover a range of CTK versions (we previously, accidentally, were only testing the latest 12.x and 13.x)

Other changes

* ensures conda packages also take on floors of `libnvjitlink>={whatever-minor-version-they-were-built-against}`

## Notes for Reviewers

### How I tested this

This uses wheels from similar PRs from RAPIDS dependencies, at build and test time:

* rapidsai/raft#2971
* rapidsai/rmm#2270
* rapidsai/ucxx#604

### Other Options

1. avoiding those versioned symbols with a build-time shim (#1855 does this, but hasn't been successful yet)
2. statically linking libnvJitLink (hasn't been successful yet)

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #1862
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
…wheels against mix of CTK versions (rapidsai#1862)

The changes from rapidsai#1405 introduced linking against nvJitLink. nvJitLink has versioned symbols that are added in each new CTK release, and some of those are exposed in `libcuvs.so`.

`libcuvs` wheels are built against the latest CTK supported in RAPIDS (CUDA 13.1.1 as of this writing), so when those wheels are used in environments with older nvJitLink, runtime errors like this can happen:

> libcugraph.so: undefined symbol: __nvJitLinkGetErrorLog_13_1, version libnvJitLink.so.13

For more details, see rapidsai/cugraph#5443

This tries to fix that.

Contributes to rapidsai/build-planning#257

* builds CUDA 13 wheels with the 13.0 CTK
* ensures CUDA 13 wheels ship with a runtime dependency of `nvidia-nvjitlink>={whatever-minor-version-they-were-built-against}`

Contributes to rapidsai/build-planning#256

* updates wheel tests to cover a range of CTK versions (we previously, accidentally, were only testing the latest 12.x and 13.x)

Other changes

* ensures conda packages also take on floors of `libnvjitlink>={whatever-minor-version-they-were-built-against}`

This uses wheels from similar PRs from RAPIDS dependencies, at build and test time:

* rapidsai/raft#2971
* rapidsai/rmm#2270
* rapidsai/ucxx#604

1. avoiding those versioned symbols with a build-time shim (rapidsai#1855 does this, but hasn't been successful yet)
2. statically linking libnvJitLink (hasn't been successful yet)

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: rapidsai#1862
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Introduce JIT-LTO Infrastructure and apply on interleaved_scan_kernel

8 participants