Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

This PR fixes an NRT caching bug encountered while running groups of tests for a separate task.

NRT currently must be toggleable, because otherwise the lack of the proper incref/decref pruning pass from upstream numba makes the test suite walltime intractable. When enabled, numba may inject incref/decref calls into generated code. Later on, numba inspects the code to be linked for these calls to determine if the other half of the NRT library needs to be linked. However the current logic answers "no" if NRT is not currently enabled.

This may cause linker errors down the line if we get a cache hit and the setting is later toggled off. This may be observed via:

def array_reshape1d(arr, newshape, got):
    y = arr.reshape(newshape)
    for i in range(y.shape[0]):
        got[i] = y[i]



def array_reshape(arr, newshape):
    return arr.reshape(newshape)


config.CUDA_ENABLE_NRT=1
@cuda.jit
def kernel(out):
    out[0] = np.min(np.array([1.0, 2.0, 0.0, -0.0, 1.0, -1.5]))

out = cp.zeros(1, dtype=cp.float64)      
kernel[1, 1](out)
config.CUDA_ENABLE_NRT=0


arr = np.arange(24)
kernel2 = cuda.jit(array_reshape1d)
expected = array_reshape(arr, (24,))
got = np.zeros(expected.shape, dtype=arr.dtype)
kernel2[1, 1](arr, (24,), got)
cuda.bindings.nvjitlink.nvJitLinkError: ERROR_PTX_COMPILE (4)                                                                                                                                                                                              
Linker error log: ptxas fatal   : Unresolved extern function 'NRT_incref'                                                                                                                                                                                  
ERROR NVJITLINK_ERROR_PTX_COMPILE: JIT the PTX (ltoPtx)

This PR fixes the issue by remembering if NRT was ever turned on for this python session and avoiding the short circuit that decides not to link NRT if it's not currently enabled.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 10, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 10, 2025

Greptile Overview

Greptile Summary

This PR fixes an NRT (Numba Runtime) caching bug that caused linker errors when NRT was toggled between kernel compilations.

  • Root cause: The compile_subroutine method's cache key did not include NRT state, allowing cached NRT-enabled code to be reused when NRT was later disabled
  • Fix: Adds self.enable_nrt to the cache key in base.py, ensuring subroutines compiled with different NRT states are cached separately
  • Test: Adds regression test that compiles a kernel with NRT enabled, then compiles another kernel with NRT disabled that uses the same reshape primitive, verifying no linker errors occur

Confidence Score: 5/5

  • This PR is safe to merge - it's a targeted bug fix with a regression test that adds NRT state to the cache key
  • The fix is minimal, well-targeted, and addresses the root cause correctly. Adding NRT state to the cache key is the appropriate solution to prevent false cache hits. The regression test validates the fix by reproducing the exact scenario described in the PR.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/core/base.py 5/5 Adds self.enable_nrt to the subroutine cache key to prevent cache hits when NRT state differs, fixing a bug where NRT-enabled code could be reused when NRT is disabled.
numba_cuda/numba/cuda/tests/nrt/test_nrt.py 5/5 Adds regression test for the NRT toggle caching bug, verifying that kernels work correctly when NRT is enabled for one kernel then disabled for another that reuses cached primitives.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/dispatcher.py, line 231-235 (link)

    style: Docstring is now outdated. It says "NRT must be enabled for the CUDATargetContext" but the code now also proceeds when _LINKED_NRT is True (i.e., when NRT was previously linked in this session).

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@gmarkall
Copy link
Contributor

Can the reproducer in the description be turned into a test case?

@gmarkall gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Dec 11, 2025
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller brandon-b-miller added 3 - Ready for Review Ready for review by team and removed 4 - Waiting on author Waiting for author to respond to review labels Dec 11, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. numba_cuda/numba/cuda/tests/nrt/test_nrt.py, line 386 (link)

    logic: _Kernel._LINKED_NRT does not exist in the codebase. This line was added when _LINKED_NRT existed in a previous commit, but that commit was reverted. This will raise AttributeError: type object '_Kernel' has no attribute '_LINKED_NRT' when the test runs. Either remove this line or re-add the _LINKED_NRT attribute to the _Kernel class.

  2. numba_cuda/numba/cuda/tests/nrt/test_nrt.py, line 396-406 (link)

    style: Config changes not restored after test. Consider using override_config context manager (used elsewhere in this file) or add cleanup to restore config.CUDA_ENABLE_NRT to its original value to prevent test pollution.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gmarkall gmarkall added 5 - Ready to merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 11, 2025
@brandon-b-miller brandon-b-miller merged commit 30b024f into NVIDIA:main Dec 11, 2025
71 checks passed
@brandon-b-miller brandon-b-miller deleted the fix-nrt-caching branch December 11, 2025 15:10
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Dec 17, 2025
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
@gmarkall gmarkall mentioned this pull request Dec 17, 2025
gmarkall added a commit that referenced this pull request Dec 17, 2025
- Capture global device arrays in kernels and device functions (#666)
- Fix #624: Accept Numba IR
nodes in all places Numba-CUDA IR nodes are expected
(#643)
- Fix Issue #588: separate
compilation of NVVM IR modules when generating debuginfo
(#591)
- feat: allow printing nested tuples
(#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0
(#655)
- build(deps): bump actions/upload-artifact from 4 to 5
(#652)
- Test RAPIDS 25.12 (#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests
(#662)
- feat: add print support for int64 tuples
(#663)
- Only run dependabot monthly and open fewer PRs
(#658)
- test: fix bogus `self` argument to `Context`
(#656)
- Fix false negative NRT link decision when NRT was previously toggled
on (#650)
- Add support for dependabot
(#647)
- refactor: cull dead linker objects
(#649)
- Migrate numba-cuda driver to use cuda.core.launch API
(#609)
- feat: add set_shared_memory_carveout
(#629)
- chore: bump version in pixi.toml
(#641)
- refactor: remove devicearray code to reduce complexity
(#600)
ZzEeKkAa added a commit to ZzEeKkAa/numba-cuda that referenced this pull request Jan 8, 2026
v0.23.0

- Capture global device arrays in kernels and device functions (NVIDIA#666)
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to merge Testing and reviews complete, ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants