Skip to content

Add arch specific target support#549

Merged
gmarkall merged 17 commits intoNVIDIA:mainfrom
ZzEeKkAa:yhavrylko/feature/arch_specific_target
Jan 12, 2026
Merged

Add arch specific target support#549
gmarkall merged 17 commits intoNVIDIA:mainfrom
ZzEeKkAa:yhavrylko/feature/arch_specific_target

Conversation

@ZzEeKkAa
Copy link
Contributor

WIP! Suppose to work, need to add some tests.

Closes: #469

@ZzEeKkAa ZzEeKkAa requested a review from gmarkall October 24, 2025 19:11
@ZzEeKkAa ZzEeKkAa self-assigned this Oct 24, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Oct 27, 2025
@rparolin rparolin added the P0 Blocking external stakeholders label Nov 20, 2025
v0.23.0

- Capture global device arrays in kernels and device functions (NVIDIA#666)
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
@ZzEeKkAa ZzEeKkAa marked this pull request as ready for review January 8, 2026 16:57
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ZzEeKkAa
Copy link
Contributor Author

ZzEeKkAa commented Jan 8, 2026

/ok to test

@gmarkall
Copy link
Contributor

gmarkall commented Jan 8, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a) to enable Hopper-specific features like tensor memory access instructions. The changes extend the compute capability tuple from 2 elements (major, minor) to optionally 3 elements (major, minor, arch_suffix) where the suffix can be "a" (accelerated) or "f" (full).

Key changes:

  • Extended _ensure_cc in codegen.py to automatically add "a" suffix for CC >= 9.0
  • Added _verify_cc_tuple validation function in nvrtc.py to handle and validate arch suffixes
  • Updated get_arch_option to accept and propagate the arch suffix parameter
  • Modified _Linker to support 3-element CC tuples for proper target generation

Issues found:

  • Critical: The test function signature in test_cuda_driver.py has a parameter mismatch - the C function declares 2 parameters but declare_device only specifies 1, which will cause runtime errors
  • Minor: Outdated docstring parameter name in find_closest_arch

Confidence Score: 2/5

  • This PR has a critical test bug that will cause runtime failures
  • The core implementation for arch-specific targets appears sound, but the test has a critical signature mismatch that will cause failures. The C function expects 2 parameters but is declared with only 1, making the test non-functional.
  • Pay close attention to numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py - the function signature mismatch must be fixed before merging

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/codegen.py 4/5 Modified _ensure_cc to add arch suffix ('a') for CC >= 9.0 and removed explicit CC parameter from get_cubin call
numba_cuda/numba/cuda/cudadrv/driver.py 5/5 Added support for 3-element CC tuples in _Linker.__init__ to handle arch-specific targets
numba_cuda/numba/cuda/cudadrv/nvrtc.py 3/5 Added _verify_cc_tuple function and updated get_arch_option to support arch suffixes; docstring parameter name is outdated
numba_cuda/numba/cuda/testing.py 5/5 Added skip_unless_cc_90 test decorator for CC 9.0+ tests
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py 2/5 Added test for arch-specific features; signature mismatch between C function (2 params) and declare_device (1 param) will cause runtime error

""")

set_descriptor = cuda.declare_device(
"set_descriptor",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types.int32(types.CPointer(types.int32)) signature doesn't match the C function which takes two parameters (int *out, int* smem), but declare_device is only passed one parameter type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an error, the Numba calling convention specifies that the return value is stored into the value pointed to by the first argument.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Additional Comments (1)

numba_cuda/numba/cuda/cudadrv/nvrtc.py
Docstring parameter name is outdated

    :param cc: Compute capability as a tuple ``(MAJOR, MINOR)``

@gmarkall
Copy link
Contributor

gmarkall commented Jan 8, 2026

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds architecture-specific target support (sm_90a, sm_90f) by extending compute capability tuples from 2-element to optional 3-element format. Modifies compilation pipeline to automatically append "a" suffix for CC 9.0+ devices and validates architecture suffixes in NVRTC compilation. Updates linker to handle the new tuple format for generating architecture strings.

Confidence Score: 1/5

  • Contains critical backward compatibility issues and runtime errors
  • Two critical issues block merge: (1) codegen.py modifies device.compute_capability from 2-tuple to 3-tuple, breaking comparisons throughout the codebase where 2-tuples are expected (testing.py:282, driver.py:591, driver.py:622, and multiple test files); (2) nvrtc.py:202 has tuple concatenation bug causing immediate runtime crash. Additionally, there are concerns about cache key consistency and arch suffix handling with FORCE_CUDA_CC.
  • numba_cuda/numba/cuda/codegen.py and numba_cuda/numba/cuda/cudadrv/nvrtc.py require immediate fixes before merge

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/codegen.py 1/5 Modifies compute_capability from 2-tuple to 3-tuple; breaks existing comparisons throughout codebase
numba_cuda/numba/cuda/cudadrv/nvrtc.py 1/5 Contains tuple concatenation bug that will cause runtime error; arch suffix validation added correctly
numba_cuda/numba/cuda/cudadrv/driver.py 3/5 Updates linker to handle optional 3rd arch element in cc tuple; implementation looks correct

Comment on lines +212 to +214
cc = device.compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
return cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks backward compatibility by modifying compute_capability from a 2-tuple to a 3-tuple. Throughout the codebase, device.compute_capability is expected to be a 2-tuple (major, minor) and is compared with 2-tuples:

  • testing.py:282: cc >= (major, minor) where cc = device.compute_capability
  • driver.py:591: self.compute_capability < MIN_REQUIRED_CC
  • driver.py:622: self.compute_capability >= (5, 3)
  • Multiple test files perform similar comparisons

After line 213 executes, cc becomes a 3-tuple, which will break tuple comparisons. For example, (9, 0, "a") >= (9, 0) evaluates to True in Python, but (8, 0, "") >= (9, 0) would incorrectly evaluate to False due to string comparison.

The modification should only happen locally within this method and not affect the device's compute_capability property. Consider storing the arch-specific version separately or ensure all comparison sites are updated to handle 3-tuples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device's compute capability is not being affected.

Comment on lines 200 to 203
msg = (
"GPU compute capability %d.%d is not supported"
"(requires >=%d.%d)" % (mycc + cc)
"(requires >=%d.%d)" % (cc + supported_cc)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple concatenation syntax error will cause runtime crash. The expression cc + supported_cc attempts to concatenate two tuples for use in string formatting with %, but the format string expects 4 separate values, not a 4-tuple.

This should be:

Suggested change
msg = (
"GPU compute capability %d.%d is not supported"
"(requires >=%d.%d)" % (mycc + cc)
"(requires >=%d.%d)" % (cc + supported_cc)
)
msg = (
"GPU compute capability %d.%d is not supported"
"(requires >=%d.%d)" % (cc[0], cc[1], supported_cc[0], supported_cc[1])
)

Note: This bug exists in both the original and modified code at this location. The refactoring changed the parameter name from mycc to cc but didn't fix the pre-existing bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct, but fixing it is low-priority and should be done outside this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon re-reading the comment, I think it is bogus.

if cufunc:
return cufunc
cubin = self.get_cubin(cc=device.compute_capability)
cubin = self.get_cubin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the explicit cc=device.compute_capability parameter may cause issues. The get_cubin() method will call _ensure_cc(None), which will fetch device.compute_capability and convert it to a 3-tuple with the arch suffix.

However, if the device's compute capability is cached elsewhere or accessed later, there's an inconsistency: the device object still reports a 2-tuple, but compilation uses a 3-tuple. Verify that cached cubins are properly keyed by the 3-tuple to avoid cache misses or incorrect binary selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cc argument was removed here because it was the same as the default that got used in _ensure_cc anyway. So this removal saved making the same change in two different places.

Comment on lines 215 to 218
if config.FORCE_CUDA_CC:
arch = config.FORCE_CUDA_CC
major, minor = config.FORCE_CUDA_CC
else:
arch = find_closest_arch((major, minor))
return "compute_%d%d" % arch
major, minor = find_closest_arch((major, minor))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When config.FORCE_CUDA_CC is set, the arch parameter is discarded. If a user explicitly passes a 3-tuple with an arch suffix (e.g., (9, 0, "a")), the arch suffix will be lost when FORCE_CUDA_CC overrides the major/minor values.

Consider preserving the arch suffix:

Suggested change
if config.FORCE_CUDA_CC:
arch = config.FORCE_CUDA_CC
major, minor = config.FORCE_CUDA_CC
else:
arch = find_closest_arch((major, minor))
return "compute_%d%d" % arch
major, minor = find_closest_arch((major, minor))
if config.FORCE_CUDA_CC:
major, minor = config.FORCE_CUDA_CC
# arch parameter is preserved from the function argument
else:

This ensures that if someone calls get_arch_option(9, 0, "a") with FORCE_CUDA_CC=(9, 0), they still get "compute_90a" instead of "compute_90". Should the arch suffix be preserved when FORCE_CUDA_CC is set, or is it intentional to reset it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think arch should be discarded when FORCE_CUDA_CC is set but it isn't here. This should be fixed, but not in the way suggested by greptile.

Comment on lines +412 to +415
set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_descriptor function has an unused out parameter and doesn't actually set any descriptor—it only executes a tensormap.replace instruction. The function signature and implementation don't match the apparent intent.

If this is intentional for testing that the architecture-specific instruction compiles and runs, consider renaming the function to reflect what it actually does, or remove the unused parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is updating a TMA descriptor, so the name seems appropriate.

When `config.FORCE_CUDA_CC` is set, the arch option from it should be
used regardless of what is passed into `get_arch_option()`.

Also, if the discovered CC is not an exact match for the requested one,
it is not valid to use arch-specific versions because the arch-specific
PTX will not be compatible with the device being targeted.
@gmarkall
Copy link
Contributor

/ok to test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a, sm_100f) by extending the compute capability tuple format from 2-element (major, minor) to optional 3-element (major, minor, arch_suffix). The implementation automatically appends arch suffixes like "a" for CC >= 9.0 to enable arch-specific CUDA features such as Tensor Memory Accelerator (TMA) instructions.

Implementation Overview

Core Changes:

  • codegen.py: Modified _ensure_cc() to auto-append "a" suffix for CC >= 9.0
  • nvrtc.py: Added _verify_cc_tuple() for validation; updated compile() and get_arch_option() to handle 3-element tuples
  • driver.py: Updated _Linker.__init__() to construct arch strings from both 2 and 3-element tuples
  • testing.py: Added skip_unless_cc_90 decorator for tests requiring CC 9.0+
  • Tests: Added comprehensive tests for arch-specific compilation including TMA instruction usage

Critical Blocking Issues

1. Cache Key Inconsistency (CRITICAL - Already Noted)

_ensure_cc() returns 3-tuple (major, minor, arch) for CC >= 9.0 when cc=None, but returns user-provided tuples as-is when cc is not None. This creates:

  • Mixed 2-tuple and 3-tuple cache keys for the same logical compute capability
  • Cache lookups will fail to find previously compiled code
  • Unpacking operations using *cc will fail with 2-tuples when 3 elements are expected

2. magic_tuple() Mismatch (CRITICAL - Already Noted)

magic_tuple() at line 574 returns a 2-tuple from device.compute_capability, but _ensure_cc() returns a 3-tuple for CC >= 9.0. Since magic_tuple() is used by the caching system (via _index_key() in caching.py), this causes cache keys to be inconsistent:

  • Cache stores with key using 3-tuple from _ensure_cc()
  • Cache lookups use key with 2-tuple from magic_tuple()
  • Result: Every compilation with CC >= 9.0 will miss the cache and recompile

3. NRT Module Inconsistency (NEW CRITICAL ISSUE)

The nrt.py file at line 127 gets cc from get_current_device().compute_capability (2-tuple) and passes it directly to _Linker without applying the arch suffix transformation. For CC >= 9.0 devices:

  • JIT kernels compile with sm_90a (via _ensure_cc())
  • NRT memsys.cu compiles with sm_90 (no suffix)
  • This creates potential linking incompatibilities and means NRT doesn't benefit from arch-specific features

4. Backward Compatibility Risk

Serialized code libraries (pickle) use cc tuples as cache keys. Pre-PR caches have 2-tuple keys, post-PR can have 3-tuple keys. This breaks cache compatibility across versions for CC >= 9.0 devices.

Required Fixes

  1. Normalize ALL cc tuples: _ensure_cc() must normalize both user-provided and device-derived tuples to 3-element format
  2. Fix magic_tuple(): Apply the same transformation to return 3-tuple for CC >= 9.0
  3. Fix nrt.py: Apply arch suffix transformation before passing to _Linker
  4. Consider migration strategy: Document or implement cache invalidation for the tuple format change

The PR's goal is sound and tests are well-designed, but the implementation has critical bugs that will cause compilation cache misses and inconsistent behavior across the codebase.

Confidence Score: 1/5

  • This PR contains critical bugs that will cause cache mismatches, compilation inconsistencies, and potential runtime failures
  • Score of 1 reflects multiple critical logical errors: (1) _ensure_cc() doesn't normalize user-provided tuples causing inconsistent cache keys and unpacking failures, (2) magic_tuple() returns 2-tuple while _ensure_cc() returns 3-tuple for CC >= 9.0 causing cache misses on every compilation, (3) nrt.py won't get arch suffix for CC >= 9.0 devices creating linking incompatibilities. These are blocking issues that break core functionality.
  • Critical attention needed: codegen.py (_ensure_cc and magic_tuple methods), nrt.py (compute capability handling). Also review nvrtc.py for clarity improvements.

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/codegen.py 1/5 Critical: _ensure_cc() doesn't normalize user-provided cc tuples, creating cache inconsistencies; magic_tuple() returns 2-tuple causing cache mismatches; get_cubin() call simplified but needs validation
numba_cuda/numba/cuda/cudadrv/nvrtc.py 3/5 Added _verify_cc_tuple() for validation; variable shadowing in _verify_cc_tuple; misleading warning messages; get_arch_option() has parameter overwrite issue with FORCE_CUDA_CC
numba_cuda/numba/cuda/cudadrv/driver.py 4/5 _Linker.init() updated to handle both 2 and 3-element cc tuples for arch string formatting; implementation looks correct and handles backward compatibility well
numba_cuda/numba/cuda/memory_management/nrt.py 2/5 Not changed in PR but affected: uses device.compute_capability (2-tuple) directly without arch suffix transformation, causing inconsistency with codegen.py for CC >= 9.0 devices

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +413 to +416
set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Function signature mismatch causes runtime error. The C function set_descriptor takes 2 parameters (int *out, int* smem), but the declared signature only specifies 1 parameter. Change to types.int32(types.CPointer(types.int32), types.CPointer(types.int32)).

def kernel(a):
sm = cuda.shared.array(1, dtype=np.int32)
data_ptr = ffi.from_buffer(sm)
set_descriptor(data_ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Function call missing required parameter. set_descriptor expects 2 arguments but only receives 1. Should be set_descriptor(data_ptr, data_ptr) or similar based on the intended semantics.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (1)

numba_cuda/numba/cuda/codegen.py
[P1] Cache key inconsistency. This returns a 2-tuple (major, minor) from device.compute_capability, but _ensure_cc now returns a 3-tuple for cc >= 9.0. This creates inconsistent cache keys. Should apply the same transformation: cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "").

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

Overview

This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a, sm_100f) to enable use of arch-specific features in CUDA kernels. The implementation extends compute capability tuples from 2-element (major, minor) to optional 3-element (major, minor, arch_suffix) format.

Key Changes

  1. codegen.py: Modified _ensure_cc() to automatically append arch suffix "a" for compute capability >= 9.0
  2. nvrtc.py: Added _verify_cc_tuple() function to validate and normalize CC tuples with arch suffixes; updated compile() and get_arch_option() to handle 3-element tuples
  3. driver.py: Updated _Linker.__init__() to handle both 2 and 3-element CC tuples when formatting architecture strings
  4. testing.py: Added skip_unless_cc_90 decorator for tests requiring CC 9.0+
  5. tests: Added tests for arch-specific compilation and configuration

Critical Issues Found

  1. Backward Compatibility Concern: Cache keys in CUDACodeLibrary (PTX, cubin, LTOIR caches) will use 3-tuple keys for CC >= 9.0, which breaks compatibility with existing serialized code libraries that use 2-tuple keys. This could cause cache misses or deserialization issues.

  2. Test Bug: test_device_arch_specific has a parameter count mismatch - the C function declares 2 parameters but the Numba declaration and call only use 1 parameter.

Recommendations

  • Add validation for CC tuple format in _verify_cc_tuple() and _Linker.__init__()
  • Consider normalizing all CC representations to 3-tuples throughout the codebase for consistency
  • Add migration/compatibility logic for serialized caches to handle both 2-tuple and 3-tuple keys
  • Fix the parameter mismatch in the test C function
  • Remove redundant len(found) == 3 check in nvrtc.py since _verify_cc_tuple() always returns 3-tuples

Confidence Score: 3/5

  • This PR introduces useful functionality but has backward compatibility risks and a test bug that need attention before merging
  • Score of 3 reflects: (1) a significant backward compatibility issue with serialized cache keys that could affect production deployments, (2) a bug in the test code with parameter mismatch that will cause runtime issues, (3) missing input validation that could lead to unclear errors, and (4) several opportunities for code cleanup. The core implementation logic appears sound, but these issues need to be addressed to ensure safe deployment.
  • Pay special attention to numba_cuda/numba/cuda/codegen.py (cache serialization compatibility) and numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py (parameter mismatch bug)

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/codegen.py 3/5 Modified _ensure_cc to return 3-tuple (major, minor, arch) for CC >= 9.0. Potential cache key inconsistency and backward compatibility issues with serialized caches.
numba_cuda/numba/cuda/cudadrv/driver.py 4/5 Added support for 3-element cc tuples in _Linker.init. Changes look correct and handle both 2-tuple and 3-tuple formats properly.
numba_cuda/numba/cuda/cudadrv/nvrtc.py 3/5 Added _verify_cc_tuple function and modified get_arch_option to support arch-specific suffixes. Has redundant length check and no input validation for malformed cc tuples.
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py 2/5 Added tests for arch-specific targets. Test test_device_arch_specific has C function signature mismatch - declares 2 parameters but only 1 is used/passed.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +405 to +416
extern "C" __device__
int set_descriptor(int *out, int* smem) {
unsigned usmem = __cvta_generic_to_shared(smem);
asm volatile("tensormap.replace.tile.rank.shared::cta.b1024.b32 [%0], 2;" :: "r"(usmem));
return 0;
}
""")

set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C function set_descriptor declares two parameters (int *out, int* smem) but the Numba declaration only specifies one parameter types.CPointer(types.int32), and only one argument is passed at line 425. This is a parameter count mismatch.

Looking at the C implementation, the out parameter is never used - only smem is used. The function signature should be corrected to match the actual usage:

extern "C" __device__
int set_descriptor(int* smem) {
    // ... rest of implementation
}

Or if out is intended to be used, the Numba declaration and call site need to be updated to pass both parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of the Numba ABI.

# being optimized away.
major, minor = found
major, minor = found[0], found[1]
cc_arch = found[2] if len(found) == 3 else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check if len(found) == 3 is redundant because _verify_cc_tuple always returns a 3-tuple (see lines 60 and 66 in the same file). Consider simplifying to:

Suggested change
cc_arch = found[2] if len(found) == 3 else ""
cc_arch = found[2]

This makes the code clearer and avoids the unnecessary conditional.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +33 to +46
def _verify_cc_tuple(cc):
version = _get_nvrtc_version()
ver_str = lambda version: ".".join(str(v) for v in version)

if len(cc) == 3:
cc, arch = (cc[0], cc[1]), cc[2]
else:
arch = ""

if arch not in ("", "a", "f"):
raise ValueError(
f"Invalid architecture suffix '{arch}' in compute capability "
f"{ver_str(cc)}{arch}. Expected '', 'a', or 'f'."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no validation that cc is a tuple with length 2 or 3. If an invalid input like cc=(9,) (1-tuple) or cc=(9, 0, "a", "extra") (4-tuple) is passed, the function will fail with unclear error messages.

Consider adding validation at the start of the function:

if not isinstance(cc, tuple) or len(cc) not in (2, 3):
    raise ValueError(
        f"Compute capability must be a 2-tuple (major, minor) or "
        f"3-tuple (major, minor, arch), got {cc}"
    )

Comment on lines +2306 to +2309
if len(cc) == 3:
arch = f"sm_{cc[0]}{cc[1]}{cc[2]}"
else:
arch = f"sm_{cc[0]}{cc[1]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the nvrtc.py issue, there's no validation that cc is a properly formed tuple. Consider adding a check:

if not isinstance(cc, tuple) or len(cc) not in (2, 3):
    raise ValueError(f"cc must be a 2-tuple or 3-tuple, got {cc}")

This will provide clearer error messages if invalid input is passed.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (1)

numba_cuda/numba/cuda/codegen.py
This change introduces a potential backward compatibility issue with serialized code libraries. The caches (_ptx_cache, _cubin_cache, _ltoir_cache) use cc tuples as keys. Before this PR, keys were always 2-tuples (major, minor). After this PR, keys can be 3-tuples (major, minor, arch).

This means:

  1. Old serialized caches with 2-tuple keys won't be found by lookups with 3-tuple keys
  2. New serialized caches with 3-tuple keys might not be compatible with older versions of the code

Additionally, there's an inconsistency: _ensure_cc() returns a 3-tuple for CC >= 9.0, but magic_tuple() at line 574 still returns a 2-tuple from device.compute_capability, which could affect cache invalidation logic.

Consider:

  • Normalizing all CC representations to always be 3-tuples throughout the codebase (with empty string for older CCs)
  • Or ensuring cache keys are normalized to a consistent format
  • Adding migration logic for old serialized caches

@gmarkall
Copy link
Contributor

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

#include <cuda_fp16.h>

extern "C" __device__
int set_descriptor(int *out, int* smem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C function signature has 2 parameters (int *out, int* smem) but the declaration on line 415 specifies only 1 parameter (types.CPointer(types.int32)). This is a critical mismatch that will cause undefined behavior.

The first parameter int *out is never used in the C code and should be removed. The correct C signature should be:

Suggested change
int set_descriptor(int *out, int* smem) {
int set_descriptor(int* smem) {

This makes the C signature match the Python declaration which expects a single pointer parameter.

Comment on lines +99 to +100
major, minor = found[0], found[1]
cc_arch = found[2] if len(found) == 3 else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary defensive check: _verify_cc_tuple() always returns a 3-tuple (see lines 60 and 66), so len(found) == 3 will always be True. This check is harmless but redundant.

If you want to keep it for safety, consider simplifying to just found[2] without the conditional, or add a comment explaining why the defensive check exists.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

device = devices.get_context().device
return device.compute_capability
cc = device.compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded logic for automatic arch suffix assignment: The code automatically assigns "a" suffix for all cc >= 9.0. However:

  1. Not all GPUs with cc >= 9.0 necessarily have an "a" variant
  2. The logic doesn't handle "f" variants (like sm_100f mentioned in tests)
  3. Users might want cc >= 9.0 without arch suffix

Consider either:

  • Making this configurable via an environment variable/config option
  • Or providing a way to explicitly opt-out of automatic arch suffix
  • Or documenting this behavior clearly so users understand that cc >= 9.0 automatically gets "a" suffix

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (3)

numba_cuda/numba/cuda/codegen.py
Critical inconsistency: _ensure_cc() doesn't normalize user-provided cc tuples. When cc is not None (line 208-209), it returns the tuple as-is without ensuring it has 3 elements. This means:

  1. If a user passes cc=(9, 0) (2-tuple), it's returned unchanged
  2. But when cc=None, it returns a 3-tuple with arch suffix for cc >= 9.0

This creates inconsistent cc tuple lengths throughout the codebase. All code consuming cc tuples needs to handle both 2 and 3-element tuples defensively.

Recommendation: Normalize all cc tuples to 3-element format in _ensure_cc():

def _ensure_cc(self, cc):
    if cc is not None:
        # Normalize to 3-tuple format
        if len(cc) == 2:
            cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc
    
    device = devices.get_context().device
    cc = device.compute_capability
    cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
    return cc

numba_cuda/numba/cuda/codegen.py
Cache key inconsistency: magic_tuple() returns a 2-tuple (runtime_version, cc) where cc comes directly from device.compute_capability (always a 2-tuple). However, _ensure_cc() now returns 3-tuples for cc >= 9.0.

This means:

  • Cache keys in magic_tuple() use 2-tuple format: (runtime, (9, 0))
  • But compilation caches (_ptx_cache, _cubin_cache, _ltoir_cache) use 3-tuple keys: (9, 0, "a")

This breaks cache lookups and invalidation logic. The fix should apply the same transformation as _ensure_cc():

    def magic_tuple(self):
        """
        Return a tuple unambiguously describing the codegen behaviour.
        """
        ctx = devices.get_context()
        cc = ctx.device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return (runtime.runtime.get_version(), cc)

numba_cuda/numba/cuda/codegen.py
Serialization cache format compatibility issue: The caches (_ptx_cache, _cubin_cache, _ltoir_cache) are serialized and restored with their keys intact. With this PR:

  • Old serialized caches: Keys are 2-tuples like (9, 0)
  • New serialized caches: Keys are 3-tuples like (9, 0, "a")

When deserializing old caches with the new code, or vice versa, cache lookups will fail because (9, 0) != (9, 0, "a").

Consider:

  1. Adding a version number to serialized data to detect format changes
  2. Or normalizing cache keys during deserialization in _rebuild()
  3. Or documenting that this is a breaking change requiring cache invalidation

@gmarkall
Copy link
Contributor

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +413 to +417
set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature is incorrect. The set_descriptor function is declared to take 2 parameters (int *out, int* smem), but the C code at line 406 only declares ONE parameter (int* smem). The out parameter is never used in the C implementation.

The signature should be:

Suggested change
set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
)
set_descriptor = cuda.declare_device(
"set_descriptor",
types.int32(types.CPointer(types.int32)),
link=[set_desc],
)

Or if two parameters are needed, the C code should be updated to actually use both parameters.

Comment on lines 349 to +352
cufunc = self._cufunc_cache.get(device.id, None)
if cufunc:
return cufunc
cubin = self.get_cubin(cc=device.compute_capability)
cubin = self.get_cubin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call was changed from self.get_cubin(cc=device.compute_capability) to self.get_cubin(). While this change ensures the arch suffix is automatically added via _ensure_cc(), it could have subtle implications:

  1. Previously, this explicitly passed the device's CC, which was always a 2-tuple
  2. Now it relies on _ensure_cc(None) to query and augment the CC

This change is likely intentional to ensure arch-specific compilation, but it should be noted that this modifies the cache key used for cubin lookups when cc >= 9.0. Old cached cubins compiled before this PR won't be found (cache invalidation).

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (2)

numba_cuda/numba/cuda/codegen.py
Critical cache key inconsistency: magic_tuple() returns a 2-tuple (runtime_version, cc) where cc is from device.compute_capability (always a 2-tuple). However, _ensure_cc() now returns a 3-tuple for cc >= 9.0. This creates inconsistent cache keys used by _index_key() in the caching system (see numba_cuda/numba/cuda/core/caching.py line 345).

When cc >= 9.0:

  • Cache lookups use magic_tuple() with 2-tuple: (runtime_version, (9, 0))
  • Cache stores use _ensure_cc() with 3-tuple: (runtime_version, (9, 0, "a"))

This means cached compilations will never be found, causing unnecessary recompilation every time.

The fix is to apply the same transformation in magic_tuple():

    def magic_tuple(self):
        """
        Return a tuple unambiguously describing the codegen behaviour.
        """
        ctx = devices.get_context()
        cc = ctx.device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return (runtime.runtime.get_version(), cc)

numba_cuda/numba/cuda/codegen.py
Inconsistent normalization: when cc is provided by the user (not None), it's returned as-is without normalization. But when cc=None, it returns a 3-tuple with arch suffix for cc >= 9.0. This means:

  1. User passes cc=(9, 0) → returns (9, 0) (2-tuple)
  2. User passes cc=None → returns (9, 0, "a") (3-tuple)

This creates inconsistent tuple lengths throughout the codebase. All downstream code must defensively handle both 2 and 3-element tuples.

Additionally, user-provided 3-element tuples with invalid arch suffixes are not validated here (validation only happens in nvrtc._verify_cc_tuple() later).

Recommendation: Normalize ALL cc tuples to 3-element format here:

    def _ensure_cc(self, cc):
        if cc is not None:
            # Normalize user-provided cc to 3-tuple format
            if len(cc) == 2:
                cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
            elif len(cc) == 3:
                # Validate arch suffix
                if cc[2] not in ("", "a", "f"):
                    raise ValueError(
                        f"Invalid architecture suffix '{cc[2]}' in compute "
                        f"capability {cc[0]}.{cc[1]}{cc[2]}. Expected '', 'a', or 'f'."
                    )
            else:
                raise ValueError(f"Invalid cc tuple length: {len(cc)}")
            return cc

        device = devices.get_context().device
        cc = device.compute_capability
        cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc

@gmarkall
Copy link
Contributor

/ok to test

@gmarkall
Copy link
Contributor

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +59 to +63
if found != cc:
found = (found[0], found[1], arch)
warnings.warn(
f"Device compute capability {ver_str(cc)} is not supported by "
f"NVRTC {ver_str(version)}. Using {ver_str(found)} instead."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading warning message. When found != cc (line 59), it means NVRTC doesn't support the exact CC version, so it falls back to a lower supported version. However, the warning message on lines 62-63 says "Device compute capability ... is not supported" which is confusing because:

  1. It's actually the NVRTC toolkit that doesn't support it, not the device
  2. The warning talks about "Using ... instead" but the arch suffix is preserved from the input, which might not make sense for a downgraded CC

For example, if a user requests CC 9.5 with "a" suffix on a toolkit that only supports up to 9.0, the code will return (9, 0, "a"). But the warning message is misleading about what's happening.

Suggestion: Make the warning more accurate:

warnings.warn(
    f"Compute capability {ver_str(cc)}{arch} is not supported by "
    f"NVRTC {ver_str(version)}. Downgrading to {ver_str(found)} "
    f"(closest supported version) with arch suffix '{arch}'."
)

Comment on lines +37 to +40
if len(cc) == 3:
cc, arch = (cc[0], cc[1]), cc[2]
else:
arch = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable shadowing issue: Line 38 reassigns cc from a 3-tuple to a 2-tuple (cc[0], cc[1]), extracting the arch suffix separately. This shadows the input parameter cc and makes the code harder to follow.

Additionally, this creates a potential bug on line 50 where cc (now the 2-tuple) is used in the filter comparison, but the original 3-tuple meaning is lost. While this works correctly because we're only comparing major/minor versions, the code would be clearer with distinct variable names.

Suggestion: Use distinct variable names:

if len(cc) == 3:
    cc_base = (cc[0], cc[1])
    arch = cc[2]
else:
    cc_base = cc
    arch = ""

# Then use cc_base throughout for comparisons
found = max(filter(lambda v: v <= cc_base, [v for v in supported_ccs]))

Comment on lines +213 to +235
def get_arch_option(major, minor, arch=""):
"""Matches with the closest architecture option"""
if config.FORCE_CUDA_CC:
arch = config.FORCE_CUDA_CC
fcc = config.FORCE_CUDA_CC
major, minor = fcc[0], fcc[1]
if len(fcc) == 3:
arch = fcc[2]
else:
arch = ""
else:
arch = find_closest_arch((major, minor))
return "compute_%d%d" % arch
new_major, new_minor = find_closest_arch((major, minor))
if (new_major, new_minor) != (major, minor):
# If we picked a different major / minor, then using an
# arch-specific version is invalid
if arch != "":
raise ValueError(
f"Can't use arch-specific compute_{major}{minor}{arch} with "
"closest found compute capability "
f"compute_{new_major}{new_minor}"
)
major, minor = new_major, new_minor

return f"compute_{major}{minor}{arch}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic inconsistency: When config.FORCE_CUDA_CC is set (lines 216-221), the function extracts the arch suffix from fcc and assigns it to the arch parameter. However, this overwrites the arch parameter that was passed into the function, which could be unexpected behavior.

Example problematic scenario:

# User calls: get_arch_option(9, 0, "f")  
# With FORCE_CUDA_CC = (9, 0, "a")
# Result: returns "compute_90a" (ignoring the "f" that was passed in)

The function signature suggests arch="" is a parameter the caller can control, but it gets overwritten when FORCE_CUDA_CC is set.

Recommendation: Either:

  1. Document that arch parameter is ignored when FORCE_CUDA_CC is set, OR
  2. Respect the passed-in arch parameter and only use the major/minor from FORCE_CUDA_CC

Based on the test at line 443-446 which expects the FORCE_CUDA_CC arch to be used, option 1 seems intended, so add documentation.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 12, 2026

Additional Comments (2)

numba_cuda/numba/cuda/memory_management/nrt.py
For devices with CC >= 9.0, this code path will pass a 2-tuple to _Linker, which means it won't automatically get the "a" suffix that is added in _ensure_cc() in codegen.py. This creates an inconsistency where:

  • Code compiled through CUDACodeLibrary (e.g., JIT kernels) gets sm_90a for CC 9.0 devices
  • NRT memsys.cu compiled here gets sm_90 without the suffix

This means NRT won't benefit from arch-specific features on CC >= 9.0 devices, and could potentially cause compatibility issues when linking together.

Fix: Apply the same transformation as in _ensure_cc():

cc = get_current_device().compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")

numba_cuda/numba/cuda/codegen.py
Critical normalization bug: When cc is not None (line 208-209), this function returns the user-provided tuple as-is without normalizing it to 3 elements. This creates inconsistent behavior:

Scenario 1: User calls with cc=(9, 0) (2-tuple)

  • Returns (9, 0) → 2-tuple

Scenario 2: User calls with cc=None on CC 9.0 device

  • Returns (9, 0, "a") → 3-tuple

This inconsistency causes multiple problems:

  1. Cache keys have different tuple lengths for the same logical CC
  2. Code that uses *cc to unpack will fail with 2-tuples when arch suffix is expected
  3. Comparison operations cc >= (9, 0) behave differently

Fix: Normalize ALL cc tuples consistently:

def _ensure_cc(self, cc):
    if cc is not None:
        # Normalize user-provided tuple to 3 elements
        if len(cc) == 2:
            cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
        return cc
    
    device = devices.get_context().device
    cc = device.compute_capability
    cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
    return cc

@gmarkall gmarkall merged commit a1c8294 into NVIDIA:main Jan 12, 2026
177 of 178 checks passed
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jan 12, 2026
- Add arch specific target support (NVIDIA#549)
- chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714)
- ci: relock pixi (NVIDIA#712)
- ci: remove redundant conda build in ci (NVIDIA#711)
- chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707)
- Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701)
- perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700)
- perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694)
- chore: perf lint (NVIDIA#697)
- chore(deps): bump deps in pixi lockfile (NVIDIA#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682)
- Support python `3.14` (NVIDIA#599)
- Remove customized address space tracking and address class emission in debug info (NVIDIA#669)
- Drop `experimental` from cuda.core namespace imports (NVIDIA#676)
- Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675)
- Use `rapidsai/sccache` in CI (NVIDIA#674)
- chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670)
- Set up a new VM-based CI infrastructure  (NVIDIA#604)
@gmarkall gmarkall mentioned this pull request Jan 12, 2026
gmarkall added a commit that referenced this pull request Jan 12, 2026
- Add arch specific target support (#549)
- chore: disable `locked` flag to bypass
prefix-dev/pixi#5256 (#714)
- ci: relock pixi (#712)
- ci: remove redundant conda build in ci (#711)
- chore(deps): bump numba-cuda version and relock pixi (#707)
- Dropping bits in the old CI & Propagating recent changes from
cuda-python (#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and
`nvrtc` packages for CUDA 13 (#701)
- perf: remove some exception control flow and buffer-exception
penalization for arrays (#700)
- perf: let CAI fall through instead of calling
from_cuda_array_interface (#694)
- chore: perf lint (#697)
- chore(deps): bump deps in pixi lockfile (#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible
(#682)
- Support python `3.14` (#599)
- Remove customized address space tracking and address class emission in
debug info (#669)
- Drop `experimental` from cuda.core namespace imports (#676)
- Remove dangling references to
NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675)
- Use `rapidsai/sccache` in CI (#674)
- chore(dev-deps): remove ipython and pyinstrument (#670)
- Set up a new VM-based CI infrastructure  (#604)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress P0 Blocking external stakeholders

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Add support to architecture/family specific features

3 participants