Skip to content

[Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels)#1821

Merged
LeiWang1999 merged 5 commits intotile-ai:mainfrom
LeiWang1999:feature/cuda-stubs-cudart-nvrtc
Feb 9, 2026
Merged

[Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels)#1821
LeiWang1999 merged 5 commits intotile-ai:mainfrom
LeiWang1999:feature/cuda-stubs-cudart-nvrtc

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Feb 9, 2026

  • Wheels that link directly against versioned CUDA libraries (libcudart.so., libnvrtc.so.) often break when run in an environment with a different CUDA
    Toolkit major (e.g. build with 12.x, run with 13.x).
    versioned SONAMEs.
    What’s in this PR

  • Add src/target/stubs/cudart.cc: a cudart_stub shared library that exports the CUDA Runtime API subset used by TVM/TileLang and lazily loads the real libcudart via

    • Tries multiple SONAMEs for compatibility: libcudart.so.13, .12, .11.0, .11, then libcudart.so.
  • Add src/target/stubs/nvrtc.cc: an nvrtc_stub shared library that exports the NVRTC C API subset used by TVM/TileLang and lazily loads the real libnvrtc at runtime.

    • Tries multiple SONAMEs: libnvrtc.so.13, .12, .11.2, .11.1, .11.0, .11, then libnvrtc.so.
    • Provides stable fallback error strings/return codes when NVRTC is unavailable.
  • Update CMakeLists.txt to build cudart_stub and nvrtc_stub when USE_CUDA is enabled, and force TVM’s cached CUDA_CUDART_LIBRARY / CUDA_NVRTC_LIBRARY variables to
    point to these stub targets before add_subdirectory(tvm).

    • This prevents TVM from caching and linking against a specific libcudart.so. / libnvrtc.so. and avoids version-locked DT_NEEDED entries.
  • Add cudart_stub / nvrtc_stub to TILELANG_OUTPUT_TARGETS so existing install/RPATH/patchelf handling is applied consistently.

CUDA 11/12 API compatibility

  • cudart_stub handles the cudaGraphInstantiate signature change between CUDA 11 (legacy 5-arg) and CUDA 12+ (3-arg + flags) safely:
    • Runtime dispatch prefers cudaGraphInstantiateWithFlags when present, otherwise falls back to legacy cudaGraphInstantiate.
    • Exported wrapper signature is selected with #if CUDART_VERSION >= 12000 to match the build-time headers.

Testing

  • Built stub targets with cmake -S . -B build and cmake --build build --target cudart_stub nvrtc_stub.
  • Verified via ldd that libcudart_stub.so / libnvrtc_stub.so do not hard-link to libcudart.so.* / libnvrtc.so.* (resolved dynamically at runtime).

Summary by CodeRabbit

  • New Features

    • Adds dynamic, runtime-loading stubs for CUDA runtime and NVRTC with graceful fallbacks when system libraries are absent; improves compatibility across CUDA versions and allows operation without a fixed SONAME.
    • Adds a build-time option to enable/disable CUDA stub usage and a Windows guard that prevents POSIX-only stubs on unsupported platforms.
  • Chores

    • Updated build and packaging to expose stub artifacts, adjust install-time rpath handling, and refine post-install library cleanup.

Build and ship libcudart_stub.so and libnvrtc_stub.so, then force TVM to link against them so the wheel does not hard-depend on libcudart.so.<major> / libnvrtc.so.<major>. This allows a single wheel to run across CUDA major versions where only libcudart/libnvrtc 12 or 13 is present.
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional CUDA stub shared libraries (cuda_stub, cudart_stub, nvrtc_stub) and POSIX runtime-loading implementations for CUDA Runtime and NVRTC; updates CMake wiring, output targets, and install-time patchelf removal to account for the new stubs and conditional compilation flags.

Changes

Cohort / File(s) Summary
Build Configuration
CMakeLists.txt
Add TILELANG_USE_CUDA_STUBS option; create cuda_stub, cudart_stub, nvrtc_stub shared targets gated by that option; set CUDA_CUDART_LIBRARY=cudart_stub and CUDA_NVRTC_LIBRARY=nvrtc_stub (CACHE FORCE); include stubs in TILELANG_OUTPUT_TARGETS when USE_CUDA; update patchelf removal to target libcuda.so.1 and libcuda.so; error on Windows when stubs requested.
CUDA Runtime Stub (POSIX)
src/target/stubs/cudart.cc
New shared-library stub implementing many CUDA Runtime C entrypoints. Lazily dlopen()s libcudart variants (tries multiple SONAMEs), resolves symbols with dlsym(), dispatches via an API struct, and returns safe fallback errors when the real library is unavailable.
NVRTC Stub (POSIX)
src/target/stubs/nvrtc.cc
New shared-library stub implementing NVRTC C API surface. Lazily loads libnvrtc variants, resolves required symbols into an API struct, delegates calls when available, and returns standardized fallback errors otherwise.
Windows guard in CUDA glue
src/target/stubs/cuda.cc
Add compile-time guard to error out on native Windows builds when TILELANG_USE_CUDA_STUBS would be used (POSIX-only stubs).

Sequence Diagram(s)

sequenceDiagram
    actor App as Application
    participant Stub as "CUDA / NVRTC Stub"
    participant Loader as "dlopen / dlsym"
    participant Lib as "System CUDA Library\n(libcudart / libnvrtc)"

    App->>Stub: call CUDA/NVRTC API (e.g., cudaMalloc / nvrtcCompileProgram)
    alt first call (no handle)
        Stub->>Loader: dlopen("libX.so.13") / dlopen("libX.so.12") / ...
        Loader-->>Stub: handle or NULL
        alt handle obtained
            Stub->>Loader: dlsym(handle, "symbol")
            Loader-->>Stub: function pointer
            Stub->>Stub: store pointer in API struct
        end
    end
    alt library loaded
        Stub->>Lib: invoke resolved symbol(...)
        Lib-->>Stub: result
        Stub-->>App: return result
    else library unavailable
        Stub-->>App: return fallback error/result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, dependencies

Suggested reviewers

  • oraluben
  • XuehaiPan

Poem

🐰 I hop where symbols hide and play,

dlopen lights the runtime way.
If libcudart or nvrtc roam,
my stubby paws will bring them home.
🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 1.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding lazy-loading stubs for CUDART and NVRTC to enable CUDA 11/12/13 compatible wheels, matching the core purpose of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
CMakeLists.txt (1)

469-490: Consider also removing libcudart.so.* / libnvrtc.so.* via patchelf as a safety net.

Currently, patchelf removes only libcuda.so.1 and libcuda.so. The cudart/nvrtc stubs should prevent DT_NEEDED entries for the real libcudart/libnvrtc, but if TVM's build has any secondary path that sneaks in a direct dependency (similar to how libcuda.so needed removal), those would go unnoticed.

Since patchelf --remove-needed is a no-op when the entry doesn't exist, adding libcudart.so and libnvrtc.so variants would be a low-cost safety net.

Proposed defensive patchelf additions
         COMMAND ${PATCHELF_EXECUTABLE}
           --remove-needed libcuda.so.1
           --remove-needed libcuda.so
+          --remove-needed libcudart.so
+          --remove-needed libnvrtc.so
           \"$<TARGET_FILE:${target}>\"
src/target/stubs/nvrtc.cc (1)

100-126: Silent failure when a required symbol lookup fails — error messages will be misleading.

If dlopen succeeds but a dlsym for a required symbol (e.g., nvrtcCompileProgram) fails, CreateNVRTCAPI returns a zeroed NVRTCAPI{} with no diagnostic. Callers then see "NVRTC stub: libnvrtc not found" from FallbackNvrtcErrorString, which is incorrect — the library was found but is missing symbols (possibly due to a version mismatch).

Consider logging which symbol failed before returning the empty struct. Even a one-line fprintf(stderr, ...) would save significant debugging time for users hitting version-mismatch issues.

Suggested improvement
 `#define` LOOKUP_REQUIRED(name)                                                  \
   api.name##_ = GetSymbol<decltype(api.name##_)>(handle, `#name`);               \
   if (api.name##_ == nullptr) {                                                \
+    fprintf(stderr, "nvrtc_stub: failed to resolve required symbol '%s' "      \
+                    "from loaded libnvrtc; disabling NVRTC API.\n", `#name`);     \
     return NVRTCAPI{};                                                         \
   }

This requires adding #include <cstdio> (or <stdio.h>).

src/target/stubs/cudart.cc (2)

186-241: Same silent-failure pattern as nvrtc.cc — consider adding diagnostic logging.

When GetSymbol fails for any required symbol, the macro at line 188-190 returns a zeroed CUDARuntimeAPI{} without indicating which symbol was missing. The fallback error strings then say "libcudart not found" even though the library loaded successfully — the actual problem is a missing symbol (likely due to a version mismatch).

Adding a fprintf(stderr, ...) in the LOOKUP_REQUIRED failure path (as suggested for nvrtc.cc) would make this much easier to debug.

Suggested improvement
 `#define` LOOKUP_REQUIRED(name)                                                  \
   api.name##_ = GetSymbol<decltype(api.name##_)>(handle, `#name`);               \
   if (api.name##_ == nullptr) {                                                \
+    fprintf(stderr, "cudart_stub: failed to resolve required symbol '%s' "     \
+                    "from loaded libcudart; disabling CUDA Runtime API.\n",     \
+                    `#name`);                                                    \
     return CUDARuntimeAPI{};                                                   \
   }

Requires adding #include <cstdio>.


96-104: GetSymbol helper is duplicated across all three stub files.

The GetSymbol, TryLoad*, and GetLib*Handle patterns are nearly identical in cuda.cc, cudart.cc, and nvrtc.cc. Consider extracting the common template into a shared internal header (e.g., src/target/stubs/dlopen_util.h) to reduce duplication and ensure consistent behavior if the loading logic evolves.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 528-539: Add a compile-time guard to prevent building this
CUDA-12-only wrapper against CUDA 11 headers by adding a static_assert that
checks CUDA_MAJOR_VERSION >= 12 with a clear error message; place this near the
top of the cudart stub file (or immediately before the cudaGraphInstantiate
wrapper) so the assertion triggers at compile time if the build headers are CUDA
11.x, and ensure it references the same CUDA header macros used elsewhere so
wrappers like TILELANG_CUDART_STUB_API cudaGraphInstantiate, GetCUDARuntimeAPI,
and MissingLibraryError are only compiled when CUDA_MAJOR_VERSION >= 12.

In `@src/target/stubs/nvrtc.cc`:
- Around line 80-90: The NVRTC API stub is missing CUBIN retrieval functions;
add members nvrtcGetCUBINSize_ and nvrtcGetCUBIN_ to the NVRTCAPI struct
(matching the decltype pattern used for nvrtcGetPTXSize_ and nvrtcGetPTX_) and
initialize them to nullptr so the stub exposes nvrtcGetCUBINSize and
nvrtcGetCUBIN symbols for CUBIN compilation compatibility with
tilelang/contrib/nvrtc.py.
🧹 Nitpick comments (2)
src/target/stubs/cudart.cc (1)

149-212: LOOKUP_REQUIRED blanks the entire API struct on any single symbol failure — including cudaGetErrorString_.

If GetLibCudartHandle() succeeds (i.e., the .so was loaded) but any single required symbol is missing, CreateCUDARuntimeAPI returns a zeroed CUDARuntimeAPI{}, discarding the previously resolved cudaGetErrorString_ pointer (line 164-165). The FallbackCudaErrorString fallback covers this, so it won't crash, but the error message will say "libcudart not found" when in reality libcudart was found but is missing symbols — potentially confusing during debugging.

Consider adjusting the fallback message or preserving cudaGetErrorString_ even when other lookups fail.

src/target/stubs/nvrtc.cc (1)

156-162: Inconsistent defensive zeroing of output parameters vs. cudart.cc.

In cudart.cc, wrappers like cudaGetDevice, cudaGetDeviceCount, cudaMalloc, etc., zero their output pointers in the missing-library path before returning the error. Here, nvrtcVersion doesn't zero *major/*minor, nvrtcCreateProgram doesn't null *prog, nvrtcGetPTXSize doesn't zero *ptxSizeRet, and nvrtcGetProgramLogSize doesn't zero *logSizeRet.

For consistency and defensive robustness:

Example fix for nvrtcVersion
 TILELANG_NVRTC_STUB_API nvrtcResult nvrtcVersion(int *major, int *minor) {
   auto *api = GetNVRTCAPI();
   if (api->nvrtcVersion_ == nullptr) {
+    if (major != nullptr) *major = 0;
+    if (minor != nullptr) *minor = 0;
     return MissingLibraryError();
   }
   return api->nvrtcVersion_(major, minor);
 }

Apply similarly to nvrtcCreateProgram (*prog), nvrtcGetPTXSize (*ptxSizeRet), and nvrtcGetProgramLogSize (*logSizeRet).

Also applies to: 164-174, 194-201, 211-218

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 25-33: The file defines Windows visibility macros
(TILELANG_CUDART_STUB_API) but unconditionally includes <dlfcn.h> and uses
dlopen/dlsym/dlerror, which prevents compilation on Windows; either remove the
Windows branch or add a Windows dynamic-load path: wrap the `#include` <dlfcn.h>
and all uses of dlopen/dlsym/dlerror in an `#ifndef` _WIN32 guard and implement
the corresponding Windows equivalents
(LoadLibrary/GetProcAddress/FormatMessage/GetLastError) in the _WIN32 branch, or
if Windows is not supported, delete the _WIN32/#else visibility block so only
the POSIX visibility macro and dlopen-based logic remain; update any references
in the file that call dlopen/dlsym/dlerror accordingly and keep
TILELANG_CUDART_STUB_API unchanged for exported symbols.
🧹 Nitpick comments (3)
src/target/stubs/nvrtc.cc (1)

23-35: Windows export macros defined but implementation is POSIX-only.

dlfcn.h, dlopen, dlsym, and dlerror are used unconditionally, so this file won't compile on Windows despite the __declspec(dllexport/dllimport) branch. If Windows support isn't planned, consider simplifying the export macro to the GCC visibility attribute only, or add a compile guard / #error for unsupported platforms to make the intent explicit.

src/target/stubs/cudart.cc (2)

156-209: All-or-nothing LOOKUP_REQUIRED nullifies the entire API if any single symbol is missing.

If dlopen resolves to an older or stripped libcudart that lacks even one "required" symbol (e.g., cudaGraphInstantiate, cudaIpcGetMemHandle), CreateCUDARuntimeAPI returns a default CUDARuntimeAPI{} with all pointers null — including basic operations like cudaMalloc and cudaMemcpy.

Consider splitting symbols into tiers: core (memory, streams, events) vs. optional (graph capture, IPC, cudaFuncSetAttribute). Core symbols would remain fatal; others would gracefully degrade to MissingLibraryError individually, similar to how cudaFuncSetAttribute_ is already handled on line 206.


132-135: No diagnostic when TryLoadLibCudart fails.

When all dlopen attempts fail, GetLibCudartHandle silently returns nullptr and every subsequent API call returns cudaErrorUnknown. A one-time fprintf(stderr, ...) or similar diagnostic on load failure would significantly ease debugging in environments where libcudart is unexpectedly absent.

Add compile-time checks for CUDA 12+ runtime headers to avoid signature mismatches (e.g. cudaGraphInstantiate) when the stub is built with older toolkits.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 552-562: The three-arg cudaGraphInstantiate wrapper risks calling
a CUDA-11 five-arg symbol from an unversioned libcudart.so; update the runtime
loader (GetCUDARuntimeAPI or wherever graph API pointers are populated) to
resolve cudaRuntimeGetVersion and require version >= 12000 before assigning the
CUDA-12 three-arg function pointers (e.g., api->cudaGraphInstantiate_); if the
version check fails, leave those pointers null so cudaGraphInstantiate returns
MissingLibraryError, or alternatively drop resolving the unversioned
libcudart.so fallback entirely—implement one of these two fixes and ensure
cudaGraphInstantiate uses the null-check path already present.
🧹 Nitpick comments (1)
src/target/stubs/cudart.cc (1)

167-230: All-or-nothing symbol resolution makes the stub fragile for optional/uncommon APIs.

If any LOOKUP_REQUIRED symbol fails to resolve (e.g., cudaIpcGetMemHandle or cudaGraphInstantiate on a minimal/stripped CUDA runtime), CreateCUDARuntimeAPI returns a fully zeroed struct — causing every wrapper (including fundamental ones like cudaMalloc, cudaMemcpy) to fail with cudaErrorUnknown.

Consider splitting into tiers: resolve core APIs as required, and treat graph/IPC/capture APIs as optional (like cudaFuncSetAttribute already is). Each wrapper already null-checks its own pointer, so optional symbols degrade gracefully per-function.

♻️ Suggested approach
-  LOOKUP_REQUIRED(cudaStreamBeginCapture)
-  LOOKUP_REQUIRED(cudaStreamEndCapture)
-  LOOKUP_REQUIRED(cudaGraphInstantiate)
-  LOOKUP_REQUIRED(cudaGraphLaunch)
-  LOOKUP_REQUIRED(cudaGraphDestroy)
-  LOOKUP_REQUIRED(cudaGraphExecDestroy)
-  LOOKUP_REQUIRED(cudaIpcGetMemHandle)
-  LOOKUP_REQUIRED(cudaIpcOpenMemHandle)
-  LOOKUP_REQUIRED(cudaIpcCloseMemHandle)
+  // Optional: CUDA Graph APIs (may be absent in minimal runtimes)
+  api.cudaStreamBeginCapture_ = GetSymbol<decltype(api.cudaStreamBeginCapture_)>(handle, "cudaStreamBeginCapture");
+  api.cudaStreamEndCapture_ = GetSymbol<decltype(api.cudaStreamEndCapture_)>(handle, "cudaStreamEndCapture");
+  api.cudaGraphInstantiate_ = GetSymbol<decltype(api.cudaGraphInstantiate_)>(handle, "cudaGraphInstantiate");
+  api.cudaGraphLaunch_ = GetSymbol<decltype(api.cudaGraphLaunch_)>(handle, "cudaGraphLaunch");
+  api.cudaGraphDestroy_ = GetSymbol<decltype(api.cudaGraphDestroy_)>(handle, "cudaGraphDestroy");
+  api.cudaGraphExecDestroy_ = GetSymbol<decltype(api.cudaGraphExecDestroy_)>(handle, "cudaGraphExecDestroy");
+
+  // Optional: IPC APIs
+  api.cudaIpcGetMemHandle_ = GetSymbol<decltype(api.cudaIpcGetMemHandle_)>(handle, "cudaIpcGetMemHandle");
+  api.cudaIpcOpenMemHandle_ = GetSymbol<decltype(api.cudaIpcOpenMemHandle_)>(handle, "cudaIpcOpenMemHandle");
+  api.cudaIpcCloseMemHandle_ = GetSymbol<decltype(api.cudaIpcCloseMemHandle_)>(handle, "cudaIpcCloseMemHandle");

- Adjusted comments in CMakeLists.txt to reflect the correct major versions for libcudart and NVRTC.
- Modified cudart.cc to support CUDA 11.x, including changes to the function pointer typedefs and the GraphInstantiate function to handle both legacy and new signatures.
- Updated nvrtc.cc to include support for NVRTC versions 11.0 to 13.x, ensuring compatibility across different CUDA environments.
@LeiWang1999 LeiWang1999 changed the title Add lazy-loading stubs for CUDA Runtime and NVRTC [Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) Feb 9, 2026
- Introduced an option to use POSIX dlopen-based CUDA stub libraries for better compatibility across different CUDA Toolkit versions and CPU-only machines.
- Updated CMakeLists.txt to conditionally enable CUDA stubs based on the platform.
- Added compile-time checks in CUDA, CUDART, and NVRTC stub implementations to ensure they are only built on POSIX systems, providing clear error messages for Windows users.
- Enhanced documentation within the CMake configuration for clarity on the use of CUDA stubs.
@LeiWang1999 LeiWang1999 merged commit c65dfae into tile-ai:main Feb 9, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant