Add arch specific target support by ZzEeKkAa · Pull Request #549 · NVIDIA/numba-cuda

ZzEeKkAa · 2025-10-24T19:11:37Z

WIP! Suppose to work, need to add some tests.

Closes: #469

copy-pr-bot · 2025-10-24T19:11:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…h_specific_target

v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)

…h_specific_target

copy-pr-bot · 2026-01-08T16:57:19Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ZzEeKkAa · 2026-01-08T16:57:49Z

/ok to test

gmarkall · 2026-01-08T20:07:29Z

/ok to test

greptile-apps

Greptile Overview

Greptile Summary

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a) to enable Hopper-specific features like tensor memory access instructions. The changes extend the compute capability tuple from 2 elements (major, minor) to optionally 3 elements (major, minor, arch_suffix) where the suffix can be "a" (accelerated) or "f" (full).

Key changes:

Extended _ensure_cc in codegen.py to automatically add "a" suffix for CC >= 9.0
Added _verify_cc_tuple validation function in nvrtc.py to handle and validate arch suffixes
Updated get_arch_option to accept and propagate the arch suffix parameter
Modified _Linker to support 3-element CC tuples for proper target generation

Issues found:

Critical: The test function signature in test_cuda_driver.py has a parameter mismatch - the C function declares 2 parameters but declare_device only specifies 1, which will cause runtime errors
Minor: Outdated docstring parameter name in find_closest_arch

Confidence Score: 2/5

This PR has a critical test bug that will cause runtime failures
The core implementation for arch-specific targets appears sound, but the test has a critical signature mismatch that will cause failures. The C function expects 2 parameters but is declared with only 1, making the test non-functional.
Pay close attention to numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py - the function signature mismatch must be fixed before merging

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/codegen.py	4/5	Modified `_ensure_cc` to add arch suffix ('a') for CC >= 9.0 and removed explicit CC parameter from `get_cubin` call
numba_cuda/numba/cuda/cudadrv/driver.py	5/5	Added support for 3-element CC tuples in `_Linker.__init__` to handle arch-specific targets
numba_cuda/numba/cuda/cudadrv/nvrtc.py	3/5	Added `_verify_cc_tuple` function and updated `get_arch_option` to support arch suffixes; docstring parameter name is outdated
numba_cuda/numba/cuda/testing.py	5/5	Added `skip_unless_cc_90` test decorator for CC 9.0+ tests
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py	2/5	Added test for arch-specific features; signature mismatch between C function (2 params) and declare_device (1 param) will cause runtime error

greptile-apps · 2026-01-08T20:10:58Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        """)
+
+        set_descriptor = cuda.declare_device(
+            "set_descriptor",


types.int32(types.CPointer(types.int32)) signature doesn't match the C function which takes two parameters (int *out, int* smem), but declare_device is only passed one parameter type

This is not an error, the Numba calling convention specifies that the return value is stored into the value pointed to by the first argument.

greptile-apps · 2026-01-08T20:11:00Z

Additional Comments (1)

numba_cuda/numba/cuda/cudadrv/nvrtc.py
Docstring parameter name is outdated

    :param cc: Compute capability as a tuple ``(MAJOR, MINOR)``

gmarkall · 2026-01-08T20:36:45Z

/ok to test

greptile-apps

Greptile Overview

Greptile Summary

Adds architecture-specific target support (sm_90a, sm_90f) by extending compute capability tuples from 2-element to optional 3-element format. Modifies compilation pipeline to automatically append "a" suffix for CC 9.0+ devices and validates architecture suffixes in NVRTC compilation. Updates linker to handle the new tuple format for generating architecture strings.

Confidence Score: 1/5

Contains critical backward compatibility issues and runtime errors
Two critical issues block merge: (1) codegen.py modifies device.compute_capability from 2-tuple to 3-tuple, breaking comparisons throughout the codebase where 2-tuples are expected (testing.py:282, driver.py:591, driver.py:622, and multiple test files); (2) nvrtc.py:202 has tuple concatenation bug causing immediate runtime crash. Additionally, there are concerns about cache key consistency and arch suffix handling with FORCE_CUDA_CC.
numba_cuda/numba/cuda/codegen.py and numba_cuda/numba/cuda/cudadrv/nvrtc.py require immediate fixes before merge

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/codegen.py	1/5	Modifies compute_capability from 2-tuple to 3-tuple; breaks existing comparisons throughout codebase
numba_cuda/numba/cuda/cudadrv/nvrtc.py	1/5	Contains tuple concatenation bug that will cause runtime error; arch suffix validation added correctly
numba_cuda/numba/cuda/cudadrv/driver.py	3/5	Updates linker to handle optional 3rd arch element in cc tuple; implementation looks correct

greptile-apps · 2026-01-08T20:41:23Z

numba_cuda/numba/cuda/codegen.py

+        cc = device.compute_capability
+        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
+        return cc


This change breaks backward compatibility by modifying compute_capability from a 2-tuple to a 3-tuple. Throughout the codebase, device.compute_capability is expected to be a 2-tuple (major, minor) and is compared with 2-tuples:

testing.py:282: cc >= (major, minor) where cc = device.compute_capability

driver.py:591: self.compute_capability < MIN_REQUIRED_CC

driver.py:622: self.compute_capability >= (5, 3)

Multiple test files perform similar comparisons

After line 213 executes, cc becomes a 3-tuple, which will break tuple comparisons. For example, (9, 0, "a") >= (9, 0) evaluates to True in Python, but (8, 0, "") >= (9, 0) would incorrectly evaluate to False due to string comparison.

The modification should only happen locally within this method and not affect the device's compute_capability property. Consider storing the arch-specific version separately or ensure all comparison sites are updated to handle 3-tuples.

The device's compute capability is not being affected.

greptile-apps · 2026-01-08T20:41:24Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

                msg = (
                    "GPU compute capability %d.%d is not supported"
-                    "(requires >=%d.%d)" % (mycc + cc)
+                    "(requires >=%d.%d)" % (cc + supported_cc)
                )


Tuple concatenation syntax error will cause runtime crash. The expression cc + supported_cc attempts to concatenate two tuples for use in string formatting with %, but the format string expects 4 separate values, not a 4-tuple.

This should be:

Suggested change

msg = (

"GPU compute capability %d.%d is not supported"

"(requires >=%d.%d)" % (mycc + cc)

"(requires >=%d.%d)" % (cc + supported_cc)

)

msg = (

"GPU compute capability %d.%d is not supported"

"(requires >=%d.%d)" % (cc[0], cc[1], supported_cc[0], supported_cc[1])

)

Note: This bug exists in both the original and modified code at this location. The refactoring changed the parameter name from mycc to cc but didn't fix the pre-existing bug.

I think this is correct, but fixing it is low-priority and should be done outside this PR.

Upon re-reading the comment, I think it is bogus.

greptile-apps · 2026-01-08T20:41:25Z

numba_cuda/numba/cuda/codegen.py

        if cufunc:
            return cufunc
-        cubin = self.get_cubin(cc=device.compute_capability)
+        cubin = self.get_cubin()


The removal of the explicit cc=device.compute_capability parameter may cause issues. The get_cubin() method will call _ensure_cc(None), which will fetch device.compute_capability and convert it to a 3-tuple with the arch suffix.

However, if the device's compute capability is cached elsewhere or accessed later, there's an inconsistency: the device object still reports a 2-tuple, but compilation uses a 3-tuple. Verify that cached cubins are properly keyed by the 3-tuple to avoid cache misses or incorrect binary selection.

The cc argument was removed here because it was the same as the default that got used in _ensure_cc anyway. So this removal saved making the same change in two different places.

greptile-apps · 2026-01-08T20:41:26Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

    if config.FORCE_CUDA_CC:
-        arch = config.FORCE_CUDA_CC
+        major, minor = config.FORCE_CUDA_CC
    else:
-        arch = find_closest_arch((major, minor))
-    return "compute_%d%d" % arch
+        major, minor = find_closest_arch((major, minor))


When config.FORCE_CUDA_CC is set, the arch parameter is discarded. If a user explicitly passes a 3-tuple with an arch suffix (e.g., (9, 0, "a")), the arch suffix will be lost when FORCE_CUDA_CC overrides the major/minor values.

Consider preserving the arch suffix:

Suggested change

if config.FORCE_CUDA_CC:

arch = config.FORCE_CUDA_CC

major, minor = config.FORCE_CUDA_CC

else:

arch = find_closest_arch((major, minor))

return "compute_%d%d" % arch

major, minor = find_closest_arch((major, minor))

if config.FORCE_CUDA_CC:

major, minor = config.FORCE_CUDA_CC

# arch parameter is preserved from the function argument

else:

This ensures that if someone calls get_arch_option(9, 0, "a") with FORCE_CUDA_CC=(9, 0), they still get "compute_90a" instead of "compute_90". Should the arch suffix be preserved when FORCE_CUDA_CC is set, or is it intentional to reset it?

I think arch should be discarded when FORCE_CUDA_CC is set but it isn't here. This should be fixed, but not in the way suggested by greptile.

greptile-apps · 2026-01-08T20:41:27Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        set_descriptor = cuda.declare_device(
+            "set_descriptor",
+            types.int32(types.CPointer(types.int32)),
+            link=[set_desc],


The set_descriptor function has an unused out parameter and doesn't actually set any descriptor—it only executes a tensormap.replace instruction. The function signature and implementation don't match the apparent intent.

If this is intentional for testing that the architecture-specific instruction compiles and runs, consider renaming the function to reflect what it actually does, or remove the unused parameter.

It's intentional for testing.

Also, this is updating a TMA descriptor, so the name seems appropriate.

When `config.FORCE_CUDA_CC` is set, the arch option from it should be used regardless of what is passed into `get_arch_option()`. Also, if the discovered CC is not an exact match for the requested one, it is not valid to use arch-specific versions because the arch-specific PTX will not be compatible with the device being targeted.

…h_specific_target

gmarkall · 2026-01-12T12:04:19Z

/ok to test

greptile-apps · 2026-01-12T12:07:49Z

Greptile Overview

Greptile Summary

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a, sm_100f) by extending the compute capability tuple format from 2-element (major, minor) to optional 3-element (major, minor, arch_suffix). The implementation automatically appends arch suffixes like "a" for CC >= 9.0 to enable arch-specific CUDA features such as Tensor Memory Accelerator (TMA) instructions.

Implementation Overview

Core Changes:

codegen.py: Modified _ensure_cc() to auto-append "a" suffix for CC >= 9.0
nvrtc.py: Added _verify_cc_tuple() for validation; updated compile() and get_arch_option() to handle 3-element tuples
driver.py: Updated _Linker.__init__() to construct arch strings from both 2 and 3-element tuples
testing.py: Added skip_unless_cc_90 decorator for tests requiring CC 9.0+
Tests: Added comprehensive tests for arch-specific compilation including TMA instruction usage

Critical Blocking Issues

1. Cache Key Inconsistency (CRITICAL - Already Noted)

_ensure_cc() returns 3-tuple (major, minor, arch) for CC >= 9.0 when cc=None, but returns user-provided tuples as-is when cc is not None. This creates:

Mixed 2-tuple and 3-tuple cache keys for the same logical compute capability
Cache lookups will fail to find previously compiled code
Unpacking operations using *cc will fail with 2-tuples when 3 elements are expected

2. magic_tuple() Mismatch (CRITICAL - Already Noted)

magic_tuple() at line 574 returns a 2-tuple from device.compute_capability, but _ensure_cc() returns a 3-tuple for CC >= 9.0. Since magic_tuple() is used by the caching system (via _index_key() in caching.py), this causes cache keys to be inconsistent:

Cache stores with key using 3-tuple from _ensure_cc()
Cache lookups use key with 2-tuple from magic_tuple()
Result: Every compilation with CC >= 9.0 will miss the cache and recompile

3. NRT Module Inconsistency (NEW CRITICAL ISSUE)

The nrt.py file at line 127 gets cc from get_current_device().compute_capability (2-tuple) and passes it directly to _Linker without applying the arch suffix transformation. For CC >= 9.0 devices:

JIT kernels compile with sm_90a (via _ensure_cc())
NRT memsys.cu compiles with sm_90 (no suffix)
This creates potential linking incompatibilities and means NRT doesn't benefit from arch-specific features

4. Backward Compatibility Risk

Serialized code libraries (pickle) use cc tuples as cache keys. Pre-PR caches have 2-tuple keys, post-PR can have 3-tuple keys. This breaks cache compatibility across versions for CC >= 9.0 devices.

Required Fixes

Normalize ALL cc tuples: _ensure_cc() must normalize both user-provided and device-derived tuples to 3-element format
Fix magic_tuple(): Apply the same transformation to return 3-tuple for CC >= 9.0
Fix nrt.py: Apply arch suffix transformation before passing to _Linker
Consider migration strategy: Document or implement cache invalidation for the tuple format change

The PR's goal is sound and tests are well-designed, but the implementation has critical bugs that will cause compilation cache misses and inconsistent behavior across the codebase.

Confidence Score: 1/5

This PR contains critical bugs that will cause cache mismatches, compilation inconsistencies, and potential runtime failures
Score of 1 reflects multiple critical logical errors: (1) _ensure_cc() doesn't normalize user-provided tuples causing inconsistent cache keys and unpacking failures, (2) magic_tuple() returns 2-tuple while _ensure_cc() returns 3-tuple for CC >= 9.0 causing cache misses on every compilation, (3) nrt.py won't get arch suffix for CC >= 9.0 devices creating linking incompatibilities. These are blocking issues that break core functionality.
Critical attention needed: codegen.py (_ensure_cc and magic_tuple methods), nrt.py (compute capability handling). Also review nvrtc.py for clarity improvements.

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/codegen.py	1/5	Critical: _ensure_cc() doesn't normalize user-provided cc tuples, creating cache inconsistencies; magic_tuple() returns 2-tuple causing cache mismatches; get_cubin() call simplified but needs validation
numba_cuda/numba/cuda/cudadrv/nvrtc.py	3/5	Added _verify_cc_tuple() for validation; variable shadowing in _verify_cc_tuple; misleading warning messages; get_arch_option() has parameter overwrite issue with FORCE_CUDA_CC
numba_cuda/numba/cuda/cudadrv/driver.py	4/5	_Linker.init() updated to handle both 2 and 3-element cc tuples for arch string formatting; implementation looks correct and handles backward compatibility well
numba_cuda/numba/cuda/memory_management/nrt.py	2/5	Not changed in PR but affected: uses device.compute_capability (2-tuple) directly without arch suffix transformation, causing inconsistency with codegen.py for CC >= 9.0 devices

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T12:07:52Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        set_descriptor = cuda.declare_device(
+            "set_descriptor",
+            types.int32(types.CPointer(types.int32)),
+            link=[set_desc],


[P0] Function signature mismatch causes runtime error. The C function set_descriptor takes 2 parameters (int *out, int* smem), but the declared signature only specifies 1 parameter. Change to types.int32(types.CPointer(types.int32), types.CPointer(types.int32)).

greptile-apps · 2026-01-12T12:07:53Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        def kernel(a):
+            sm = cuda.shared.array(1, dtype=np.int32)
+            data_ptr = ffi.from_buffer(sm)
+            set_descriptor(data_ptr)


[P0] Function call missing required parameter. set_descriptor expects 2 arguments but only receives 1. Should be set_descriptor(data_ptr, data_ptr) or similar based on the intended semantics.

greptile-apps · 2026-01-12T12:07:58Z

Additional Comments (1)

numba_cuda/numba/cuda/codegen.py
[P1] Cache key inconsistency. This returns a 2-tuple (major, minor) from device.compute_capability, but _ensure_cc now returns a 3-tuple for cc >= 9.0. This creates inconsistent cache keys. Should apply the same transformation: cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "").

greptile-apps · 2026-01-12T12:08:35Z

Greptile Overview

Greptile Summary

Overview

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a, sm_100f) to enable use of arch-specific features in CUDA kernels. The implementation extends compute capability tuples from 2-element (major, minor) to optional 3-element (major, minor, arch_suffix) format.

Key Changes

codegen.py: Modified _ensure_cc() to automatically append arch suffix "a" for compute capability >= 9.0
nvrtc.py: Added _verify_cc_tuple() function to validate and normalize CC tuples with arch suffixes; updated compile() and get_arch_option() to handle 3-element tuples
driver.py: Updated _Linker.__init__() to handle both 2 and 3-element CC tuples when formatting architecture strings
testing.py: Added skip_unless_cc_90 decorator for tests requiring CC 9.0+
tests: Added tests for arch-specific compilation and configuration

Critical Issues Found

Backward Compatibility Concern: Cache keys in CUDACodeLibrary (PTX, cubin, LTOIR caches) will use 3-tuple keys for CC >= 9.0, which breaks compatibility with existing serialized code libraries that use 2-tuple keys. This could cause cache misses or deserialization issues.
Test Bug: test_device_arch_specific has a parameter count mismatch - the C function declares 2 parameters but the Numba declaration and call only use 1 parameter.

Recommendations

Add validation for CC tuple format in _verify_cc_tuple() and _Linker.__init__()
Consider normalizing all CC representations to 3-tuples throughout the codebase for consistency
Add migration/compatibility logic for serialized caches to handle both 2-tuple and 3-tuple keys
Fix the parameter mismatch in the test C function
Remove redundant len(found) == 3 check in nvrtc.py since _verify_cc_tuple() always returns 3-tuples

Confidence Score: 3/5

This PR introduces useful functionality but has backward compatibility risks and a test bug that need attention before merging
Score of 3 reflects: (1) a significant backward compatibility issue with serialized cache keys that could affect production deployments, (2) a bug in the test code with parameter mismatch that will cause runtime issues, (3) missing input validation that could lead to unclear errors, and (4) several opportunities for code cleanup. The core implementation logic appears sound, but these issues need to be addressed to ensure safe deployment.
Pay special attention to numba_cuda/numba/cuda/codegen.py (cache serialization compatibility) and numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py (parameter mismatch bug)

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/codegen.py	3/5	Modified _ensure_cc to return 3-tuple (major, minor, arch) for CC >= 9.0. Potential cache key inconsistency and backward compatibility issues with serialized caches.
numba_cuda/numba/cuda/cudadrv/driver.py	4/5	Added support for 3-element cc tuples in _Linker.init. Changes look correct and handle both 2-tuple and 3-tuple formats properly.
numba_cuda/numba/cuda/cudadrv/nvrtc.py	3/5	Added _verify_cc_tuple function and modified get_arch_option to support arch-specific suffixes. Has redundant length check and no input validation for malformed cc tuples.
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py	2/5	Added tests for arch-specific targets. Test test_device_arch_specific has C function signature mismatch - declares 2 parameters but only 1 is used/passed.

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T12:08:39Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        extern "C" __device__
+        int set_descriptor(int *out, int* smem) {
+            unsigned usmem = __cvta_generic_to_shared(smem);
+            asm volatile("tensormap.replace.tile.rank.shared::cta.b1024.b32 [%0], 2;" :: "r"(usmem));
+            return 0;
+        }
+        """)
+
+        set_descriptor = cuda.declare_device(
+            "set_descriptor",
+            types.int32(types.CPointer(types.int32)),
+            link=[set_desc],


The C function set_descriptor declares two parameters (int *out, int* smem) but the Numba declaration only specifies one parameter types.CPointer(types.int32), and only one argument is passed at line 425. This is a parameter count mismatch.

Looking at the C implementation, the out parameter is never used - only smem is used. The function signature should be corrected to match the actual usage:

extern "C" __device__ int set_descriptor(int* smem) { // ... rest of implementation }

Or if out is intended to be used, the Numba declaration and call site need to be updated to pass both parameters.

This is because of the Numba ABI.

greptile-apps · 2026-01-12T12:08:40Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

    #   being optimized away.
-    major, minor = found
+    major, minor = found[0], found[1]
+    cc_arch = found[2] if len(found) == 3 else ""


The check if len(found) == 3 is redundant because _verify_cc_tuple always returns a 3-tuple (see lines 60 and 66 in the same file). Consider simplifying to:

Suggested change

cc_arch = found[2] if len(found) == 3 else ""

cc_arch = found[2]

This makes the code clearer and avoids the unnecessary conditional.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T12:08:41Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

+def _verify_cc_tuple(cc):
+    version = _get_nvrtc_version()
+    ver_str = lambda version: ".".join(str(v) for v in version)
+
+    if len(cc) == 3:
+        cc, arch = (cc[0], cc[1]), cc[2]
+    else:
+        arch = ""
+
+    if arch not in ("", "a", "f"):
+        raise ValueError(
+            f"Invalid architecture suffix '{arch}' in compute capability "
+            f"{ver_str(cc)}{arch}. Expected '', 'a', or 'f'."
+        )


There's no validation that cc is a tuple with length 2 or 3. If an invalid input like cc=(9,) (1-tuple) or cc=(9, 0, "a", "extra") (4-tuple) is passed, the function will fail with unclear error messages.

Consider adding validation at the start of the function:

if not isinstance(cc, tuple) or len(cc) not in (2, 3): raise ValueError( f"Compute capability must be a 2-tuple (major, minor) or " f"3-tuple (major, minor, arch), got {cc}" )

greptile-apps · 2026-01-12T12:08:45Z

numba_cuda/numba/cuda/cudadrv/driver.py

+        if len(cc) == 3:
+            arch = f"sm_{cc[0]}{cc[1]}{cc[2]}"
+        else:
+            arch = f"sm_{cc[0]}{cc[1]}"


Similar to the nvrtc.py issue, there's no validation that cc is a properly formed tuple. Consider adding a check:

if not isinstance(cc, tuple) or len(cc) not in (2, 3): raise ValueError(f"cc must be a 2-tuple or 3-tuple, got {cc}")

This will provide clearer error messages if invalid input is passed.

greptile-apps · 2026-01-12T12:08:48Z

Additional Comments (1)

numba_cuda/numba/cuda/codegen.py
This change introduces a potential backward compatibility issue with serialized code libraries. The caches (_ptx_cache, _cubin_cache, _ltoir_cache) use cc tuples as keys. Before this PR, keys were always 2-tuples (major, minor). After this PR, keys can be 3-tuples (major, minor, arch).

This means:

Old serialized caches with 2-tuple keys won't be found by lookups with 3-tuple keys
New serialized caches with 3-tuple keys might not be compatible with older versions of the code

Additionally, there's an inconsistency: _ensure_cc() returns a 3-tuple for CC >= 9.0, but magic_tuple() at line 574 still returns a 2-tuple from device.compute_capability, which could affect cache invalidation logic.

Consider:

Normalizing all CC representations to always be 3-tuples throughout the codebase (with empty string for older CCs)
Or ensuring cache keys are normalized to a consistent format
Adding migration logic for old serialized caches

gmarkall · 2026-01-12T12:17:40Z

/ok to test

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T12:22:23Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        #include <cuda_fp16.h>
+
+        extern "C" __device__
+        int set_descriptor(int *out, int* smem) {


The C function signature has 2 parameters (int *out, int* smem) but the declaration on line 415 specifies only 1 parameter (types.CPointer(types.int32)). This is a critical mismatch that will cause undefined behavior.

The first parameter int *out is never used in the C code and should be removed. The correct C signature should be:

Suggested change

int set_descriptor(int *out, int* smem) {

int set_descriptor(int* smem) {

This makes the C signature match the Python declaration which expects a single pointer parameter.

greptile-apps · 2026-01-12T12:22:36Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

+    major, minor = found[0], found[1]
+    cc_arch = found[2] if len(found) == 3 else ""


Unnecessary defensive check: _verify_cc_tuple() always returns a 3-tuple (see lines 60 and 66), so len(found) == 3 will always be True. This check is harmless but redundant.

If you want to keep it for safety, consider simplifying to just found[2] without the conditional, or add a comment explaining why the defensive check exists.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T12:22:39Z

numba_cuda/numba/cuda/codegen.py

        device = devices.get_context().device
-        return device.compute_capability
+        cc = device.compute_capability
+        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")


Hardcoded logic for automatic arch suffix assignment: The code automatically assigns "a" suffix for all cc >= 9.0. However:

Not all GPUs with cc >= 9.0 necessarily have an "a" variant

The logic doesn't handle "f" variants (like sm_100f mentioned in tests)

Users might want cc >= 9.0 without arch suffix

Consider either:

Making this configurable via an environment variable/config option

Or providing a way to explicitly opt-out of automatic arch suffix

Or documenting this behavior clearly so users understand that cc >= 9.0 automatically gets "a" suffix

greptile-apps · 2026-01-12T12:22:41Z

Additional Comments (3)

numba_cuda/numba/cuda/codegen.py
Critical inconsistency: _ensure_cc() doesn't normalize user-provided cc tuples. When cc is not None (line 208-209), it returns the tuple as-is without ensuring it has 3 elements. This means:

If a user passes cc=(9, 0) (2-tuple), it's returned unchanged
But when cc=None, it returns a 3-tuple with arch suffix for cc >= 9.0

This creates inconsistent cc tuple lengths throughout the codebase. All code consuming cc tuples needs to handle both 2 and 3-element tuples defensively.

Recommendation: Normalize all cc tuples to 3-element format in _ensure_cc():

def _ensure_cc(self, cc):
    if cc is not None:
        # Normalize to 3-tuple format
        if len(cc) == 2:
            cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc
    
    device = devices.get_context().device
    cc = device.compute_capability
    cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
    return cc

numba_cuda/numba/cuda/codegen.py
Cache key inconsistency: magic_tuple() returns a 2-tuple (runtime_version, cc) where cc comes directly from device.compute_capability (always a 2-tuple). However, _ensure_cc() now returns 3-tuples for cc >= 9.0.

This means:

Cache keys in magic_tuple() use 2-tuple format: (runtime, (9, 0))
But compilation caches (_ptx_cache, _cubin_cache, _ltoir_cache) use 3-tuple keys: (9, 0, "a")

This breaks cache lookups and invalidation logic. The fix should apply the same transformation as _ensure_cc():

    def magic_tuple(self):
        """
        Return a tuple unambiguously describing the codegen behaviour.
        """
        ctx = devices.get_context()
        cc = ctx.device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return (runtime.runtime.get_version(), cc)

numba_cuda/numba/cuda/codegen.py
Serialization cache format compatibility issue: The caches (_ptx_cache, _cubin_cache, _ltoir_cache) are serialized and restored with their keys intact. With this PR:

Old serialized caches: Keys are 2-tuples like (9, 0)
New serialized caches: Keys are 3-tuples like (9, 0, "a")

When deserializing old caches with the new code, or vice versa, cache lookups will fail because (9, 0) != (9, 0, "a").

Consider:

Adding a version number to serialized data to detect format changes
Or normalizing cache keys during deserialization in _rebuild()
Or documenting that this is a breaking change requiring cache invalidation

gmarkall · 2026-01-12T15:28:45Z

/ok to test

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T15:32:06Z

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

+        set_descriptor = cuda.declare_device(
+            "set_descriptor",
+            types.int32(types.CPointer(types.int32)),
+            link=[set_desc],
+        )


The function signature is incorrect. The set_descriptor function is declared to take 2 parameters (int *out, int* smem), but the C code at line 406 only declares ONE parameter (int* smem). The out parameter is never used in the C implementation.

The signature should be:

Suggested change

set_descriptor = cuda.declare_device(

"set_descriptor",

types.int32(types.CPointer(types.int32)),

link=[set_desc],

)

set_descriptor = cuda.declare_device(

"set_descriptor",

types.int32(types.CPointer(types.int32)),

link=[set_desc],

)

Or if two parameters are needed, the C code should be updated to actually use both parameters.

greptile-apps · 2026-01-12T15:32:14Z

numba_cuda/numba/cuda/codegen.py

        cufunc = self._cufunc_cache.get(device.id, None)
        if cufunc:
            return cufunc
-        cubin = self.get_cubin(cc=device.compute_capability)
+        cubin = self.get_cubin()


This call was changed from self.get_cubin(cc=device.compute_capability) to self.get_cubin(). While this change ensures the arch suffix is automatically added via _ensure_cc(), it could have subtle implications:

Previously, this explicitly passed the device's CC, which was always a 2-tuple

Now it relies on _ensure_cc(None) to query and augment the CC

This change is likely intentional to ensure arch-specific compilation, but it should be noted that this modifies the cache key used for cubin lookups when cc >= 9.0. Old cached cubins compiled before this PR won't be found (cache invalidation).

greptile-apps · 2026-01-12T15:32:16Z

Additional Comments (2)

numba_cuda/numba/cuda/codegen.py
Critical cache key inconsistency: magic_tuple() returns a 2-tuple (runtime_version, cc) where cc is from device.compute_capability (always a 2-tuple). However, _ensure_cc() now returns a 3-tuple for cc >= 9.0. This creates inconsistent cache keys used by _index_key() in the caching system (see numba_cuda/numba/cuda/core/caching.py line 345).

When cc >= 9.0:

Cache lookups use magic_tuple() with 2-tuple: (runtime_version, (9, 0))
Cache stores use _ensure_cc() with 3-tuple: (runtime_version, (9, 0, "a"))

This means cached compilations will never be found, causing unnecessary recompilation every time.

The fix is to apply the same transformation in magic_tuple():

    def magic_tuple(self):
        """
        Return a tuple unambiguously describing the codegen behaviour.
        """
        ctx = devices.get_context()
        cc = ctx.device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return (runtime.runtime.get_version(), cc)

numba_cuda/numba/cuda/codegen.py
Inconsistent normalization: when cc is provided by the user (not None), it's returned as-is without normalization. But when cc=None, it returns a 3-tuple with arch suffix for cc >= 9.0. This means:

User passes cc=(9, 0) → returns (9, 0) (2-tuple)
User passes cc=None → returns (9, 0, "a") (3-tuple)

This creates inconsistent tuple lengths throughout the codebase. All downstream code must defensively handle both 2 and 3-element tuples.

Additionally, user-provided 3-element tuples with invalid arch suffixes are not validated here (validation only happens in nvrtc._verify_cc_tuple() later).

Recommendation: Normalize ALL cc tuples to 3-element format here:

    def _ensure_cc(self, cc):
        if cc is not None:
            # Normalize user-provided cc to 3-tuple format
            if len(cc) == 2:
                cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
            elif len(cc) == 3:
                # Validate arch suffix
                if cc[2] not in ("", "a", "f"):
                    raise ValueError(
                        f"Invalid architecture suffix '{cc[2]}' in compute "
                        f"capability {cc[0]}.{cc[1]}{cc[2]}. Expected '', 'a', or 'f'."
                    )
            else:
                raise ValueError(f"Invalid cc tuple length: {len(cc)}")
            return cc

        device = devices.get_context().device
        cc = device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc

gmarkall · 2026-01-12T16:42:35Z

/ok to test

gmarkall · 2026-01-12T16:43:19Z

/ok to test

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T16:48:51Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

+    if found != cc:
+        found = (found[0], found[1], arch)
+        warnings.warn(
+            f"Device compute capability {ver_str(cc)} is not supported by "
+            f"NVRTC {ver_str(version)}. Using {ver_str(found)} instead."


Misleading warning message. When found != cc (line 59), it means NVRTC doesn't support the exact CC version, so it falls back to a lower supported version. However, the warning message on lines 62-63 says "Device compute capability ... is not supported" which is confusing because:

It's actually the NVRTC toolkit that doesn't support it, not the device

The warning talks about "Using ... instead" but the arch suffix is preserved from the input, which might not make sense for a downgraded CC

For example, if a user requests CC 9.5 with "a" suffix on a toolkit that only supports up to 9.0, the code will return (9, 0, "a"). But the warning message is misleading about what's happening.

Suggestion: Make the warning more accurate:

warnings.warn( f"Compute capability {ver_str(cc)}{arch} is not supported by " f"NVRTC {ver_str(version)}. Downgrading to {ver_str(found)} " f"(closest supported version) with arch suffix '{arch}'." )

greptile-apps · 2026-01-12T16:48:52Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

+    if len(cc) == 3:
+        cc, arch = (cc[0], cc[1]), cc[2]
+    else:
+        arch = ""


Variable shadowing issue: Line 38 reassigns cc from a 3-tuple to a 2-tuple (cc[0], cc[1]), extracting the arch suffix separately. This shadows the input parameter cc and makes the code harder to follow.

Additionally, this creates a potential bug on line 50 where cc (now the 2-tuple) is used in the filter comparison, but the original 3-tuple meaning is lost. While this works correctly because we're only comparing major/minor versions, the code would be clearer with distinct variable names.

Suggestion: Use distinct variable names:

if len(cc) == 3: cc_base = (cc[0], cc[1]) arch = cc[2] else: cc_base = cc arch = "" # Then use cc_base throughout for comparisons found = max(filter(lambda v: v <= cc_base, [v for v in supported_ccs]))

greptile-apps · 2026-01-12T16:48:53Z

numba_cuda/numba/cuda/cudadrv/nvrtc.py

+def get_arch_option(major, minor, arch=""):
    """Matches with the closest architecture option"""
    if config.FORCE_CUDA_CC:
-        arch = config.FORCE_CUDA_CC
+        fcc = config.FORCE_CUDA_CC
+        major, minor = fcc[0], fcc[1]
+        if len(fcc) == 3:
+            arch = fcc[2]
+        else:
+            arch = ""
    else:
-        arch = find_closest_arch((major, minor))
-    return "compute_%d%d" % arch
+        new_major, new_minor = find_closest_arch((major, minor))
+        if (new_major, new_minor) != (major, minor):
+            # If we picked a different major / minor, then using an
+            # arch-specific version is invalid
+            if arch != "":
+                raise ValueError(
+                    f"Can't use arch-specific compute_{major}{minor}{arch} with "
+                    "closest found compute capability "
+                    f"compute_{new_major}{new_minor}"
+                )
+        major, minor = new_major, new_minor
+
+    return f"compute_{major}{minor}{arch}"


Logic inconsistency: When config.FORCE_CUDA_CC is set (lines 216-221), the function extracts the arch suffix from fcc and assigns it to the arch parameter. However, this overwrites the arch parameter that was passed into the function, which could be unexpected behavior.

Example problematic scenario:

# User calls: get_arch_option(9, 0, "f") # With FORCE_CUDA_CC = (9, 0, "a") # Result: returns "compute_90a" (ignoring the "f" that was passed in)

The function signature suggests arch="" is a parameter the caller can control, but it gets overwritten when FORCE_CUDA_CC is set.

Recommendation: Either:

Document that arch parameter is ignored when FORCE_CUDA_CC is set, OR

Respect the passed-in arch parameter and only use the major/minor from FORCE_CUDA_CC

Based on the test at line 443-446 which expects the FORCE_CUDA_CC arch to be used, option 1 seems intended, so add documentation.

greptile-apps · 2026-01-12T16:48:56Z

Additional Comments (2)

numba_cuda/numba/cuda/memory_management/nrt.py
For devices with CC >= 9.0, this code path will pass a 2-tuple to _Linker, which means it won't automatically get the "a" suffix that is added in _ensure_cc() in codegen.py. This creates an inconsistency where:

Code compiled through CUDACodeLibrary (e.g., JIT kernels) gets sm_90a for CC 9.0 devices
NRT memsys.cu compiled here gets sm_90 without the suffix

This means NRT won't benefit from arch-specific features on CC >= 9.0 devices, and could potentially cause compatibility issues when linking together.

Fix: Apply the same transformation as in _ensure_cc():

cc = get_current_device().compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")

numba_cuda/numba/cuda/codegen.py
Critical normalization bug: When cc is not None (line 208-209), this function returns the user-provided tuple as-is without normalizing it to 3 elements. This creates inconsistent behavior:

Scenario 1: User calls with cc=(9, 0) (2-tuple)

Returns (9, 0) → 2-tuple

Scenario 2: User calls with cc=None on CC 9.0 device

Returns (9, 0, "a") → 3-tuple

This inconsistency causes multiple problems:

Cache keys have different tuple lengths for the same logical CC
Code that uses *cc to unpack will fail with 2-tuples when arch suffix is expected
Comparison operations cc >= (9, 0) behave differently

Fix: Normalize ALL cc tuples consistently:

def _ensure_cc(self, cc):
    if cc is not None:
        # Normalize user-provided tuple to 3 elements
        if len(cc) == 2:
            cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc
    
    device = devices.get_context().device
    cc = device.compute_capability
    cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
    return cc

- Add arch specific target support (NVIDIA#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714) - ci: relock pixi (NVIDIA#712) - ci: remove redundant conda build in ci (NVIDIA#711) - chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694) - chore: perf lint (NVIDIA#697) - chore(deps): bump deps in pixi lockfile (NVIDIA#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682) - Support python `3.14` (NVIDIA#599) - Remove customized address space tracking and address class emission in debug info (NVIDIA#669) - Drop `experimental` from cuda.core namespace imports (NVIDIA#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675) - Use `rapidsai/sccache` in CI (NVIDIA#674) - chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670) - Set up a new VM-based CI infrastructure (NVIDIA#604)

- Add arch specific target support (#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (#714) - ci: relock pixi (#712) - ci: remove redundant conda build in ci (#711) - chore(deps): bump numba-cuda version and relock pixi (#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (#694) - chore: perf lint (#697) - chore(deps): bump deps in pixi lockfile (#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (#682) - Support python `3.14` (#599) - Remove customized address space tracking and address class emission in debug info (#669) - Drop `experimental` from cuda.core namespace imports (#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675) - Use `rapidsai/sccache` in CI (#674) - chore(dev-deps): remove ipython and pyinstrument (#670) - Set up a new VM-based CI infrastructure (#604)

Add arch specific target support

fc81a8f

ZzEeKkAa requested a review from gmarkall October 24, 2025 19:11

ZzEeKkAa self-assigned this Oct 24, 2025

Fix cc for pre (9, 0)

65bfd01

gmarkall added the 2 - In Progress Currently a work in progress label Oct 27, 2025

Merge remote-tracking branch 'origin/main' into yhavrylko/feature/arc…

2fa5f7b

…h_specific_target

rparolin added the P0 Blocking external stakeholders label Nov 20, 2025

ZzEeKkAa added 5 commits November 25, 2025 10:04

Merge remote-tracking branch 'origin/main' into yhavrylko/feature/arc…

5479dd7

…h_specific_target

Merge remote-tracking branch 'origin/main' into yhavrylko/feature/arc…

3ec37e3

…h_specific_target

Merge remote-tracking branch 'origin/main' into yhavrylko/feature/arc…

d5e0c3d

…h_specific_target

Add tests

8e2c34b

ZzEeKkAa marked this pull request as ready for review January 8, 2026 16:57

Accept compute capabilities with no family

0647852

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

gmarkall approved these changes Jan 8, 2026

View reviewed changes

Update pixi.lock

a0d7b7b

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

gmarkall added 4 commits January 12, 2026 11:44

Test arch-specific CCs with compile_ptx()

02b815a

Use pixi.lock from main

6fe89f9

Merge remote-tracking branch 'NVIDIA/main' into yhavrylko/feature/arc…

2f14dbb

…h_specific_target

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Delete and recreate pixi.lock

3e8440a

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Merge branch 'main' into yhavrylko/feature/arch_specific_target

1da2e30

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Only test with 100f arch when CTK >= 12.9

f91c22e

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

gmarkall merged commit a1c8294 into NVIDIA:main Jan 12, 2026
177 of 178 checks passed

gmarkall mentioned this pull request Jan 12, 2026

Bump version to 0.24.0 #716

Merged

isVoid mentioned this pull request Feb 11, 2026

Create Highlevel Bindings for FP8 Datatype #788

Open

	cc_arch = found[2] if len(found) == 3 else ""
	cc_arch = found[2]

	int set_descriptor(int out, int smem) {
	int set_descriptor(int* smem) {

		major, minor = found[0], found[1]
		cc_arch = found[2] if len(found) == 3 else ""

Conversation

ZzEeKkAa commented Oct 24, 2025

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

ZzEeKkAa commented Jan 8, 2026

Uh oh!

gmarkall commented Jan 8, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 8, 2026

Uh oh!

gmarkall commented Jan 8, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarkall commented Jan 12, 2026

Uh oh!

greptile-apps bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Implementation Overview

Critical Blocking Issues

1. Cache Key Inconsistency (CRITICAL - Already Noted)

2. magic_tuple() Mismatch (CRITICAL - Already Noted)

3. NRT Module Inconsistency (NEW CRITICAL ISSUE)

4. Backward Compatibility Risk

Required Fixes

Confidence Score: 1/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

greptile-apps bot commented Jan 12, 2026 •

edited

Loading