Add arch specific target support#549
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…h_specific_target
…h_specific_target
…h_specific_target
v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
…h_specific_target
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a) to enable Hopper-specific features like tensor memory access instructions. The changes extend the compute capability tuple from 2 elements (major, minor) to optionally 3 elements (major, minor, arch_suffix) where the suffix can be "a" (accelerated) or "f" (full).
Key changes:
- Extended
_ensure_ccincodegen.pyto automatically add "a" suffix for CC >= 9.0 - Added
_verify_cc_tuplevalidation function innvrtc.pyto handle and validate arch suffixes - Updated
get_arch_optionto accept and propagate the arch suffix parameter - Modified
_Linkerto support 3-element CC tuples for proper target generation
Issues found:
- Critical: The test function signature in
test_cuda_driver.pyhas a parameter mismatch - the C function declares 2 parameters butdeclare_deviceonly specifies 1, which will cause runtime errors - Minor: Outdated docstring parameter name in
find_closest_arch
Confidence Score: 2/5
- This PR has a critical test bug that will cause runtime failures
- The core implementation for arch-specific targets appears sound, but the test has a critical signature mismatch that will cause failures. The C function expects 2 parameters but is declared with only 1, making the test non-functional.
- Pay close attention to
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py- the function signature mismatch must be fixed before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| numba_cuda/numba/cuda/codegen.py | 4/5 | Modified _ensure_cc to add arch suffix ('a') for CC >= 9.0 and removed explicit CC parameter from get_cubin call |
| numba_cuda/numba/cuda/cudadrv/driver.py | 5/5 | Added support for 3-element CC tuples in _Linker.__init__ to handle arch-specific targets |
| numba_cuda/numba/cuda/cudadrv/nvrtc.py | 3/5 | Added _verify_cc_tuple function and updated get_arch_option to support arch suffixes; docstring parameter name is outdated |
| numba_cuda/numba/cuda/testing.py | 5/5 | Added skip_unless_cc_90 test decorator for CC 9.0+ tests |
| numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py | 2/5 | Added test for arch-specific features; signature mismatch between C function (2 params) and declare_device (1 param) will cause runtime error |
| """) | ||
|
|
||
| set_descriptor = cuda.declare_device( | ||
| "set_descriptor", |
There was a problem hiding this comment.
types.int32(types.CPointer(types.int32)) signature doesn't match the C function which takes two parameters (int *out, int* smem), but declare_device is only passed one parameter type
There was a problem hiding this comment.
This is not an error, the Numba calling convention specifies that the return value is stored into the value pointed to by the first argument.
Additional Comments (1)
|
|
/ok to test |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Adds architecture-specific target support (sm_90a, sm_90f) by extending compute capability tuples from 2-element to optional 3-element format. Modifies compilation pipeline to automatically append "a" suffix for CC 9.0+ devices and validates architecture suffixes in NVRTC compilation. Updates linker to handle the new tuple format for generating architecture strings.
Confidence Score: 1/5
- Contains critical backward compatibility issues and runtime errors
- Two critical issues block merge: (1) codegen.py modifies device.compute_capability from 2-tuple to 3-tuple, breaking comparisons throughout the codebase where 2-tuples are expected (testing.py:282, driver.py:591, driver.py:622, and multiple test files); (2) nvrtc.py:202 has tuple concatenation bug causing immediate runtime crash. Additionally, there are concerns about cache key consistency and arch suffix handling with FORCE_CUDA_CC.
- numba_cuda/numba/cuda/codegen.py and numba_cuda/numba/cuda/cudadrv/nvrtc.py require immediate fixes before merge
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| numba_cuda/numba/cuda/codegen.py | 1/5 | Modifies compute_capability from 2-tuple to 3-tuple; breaks existing comparisons throughout codebase |
| numba_cuda/numba/cuda/cudadrv/nvrtc.py | 1/5 | Contains tuple concatenation bug that will cause runtime error; arch suffix validation added correctly |
| numba_cuda/numba/cuda/cudadrv/driver.py | 3/5 | Updates linker to handle optional 3rd arch element in cc tuple; implementation looks correct |
| cc = device.compute_capability | ||
| cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "") | ||
| return cc |
There was a problem hiding this comment.
This change breaks backward compatibility by modifying compute_capability from a 2-tuple to a 3-tuple. Throughout the codebase, device.compute_capability is expected to be a 2-tuple (major, minor) and is compared with 2-tuples:
testing.py:282:cc >= (major, minor)wherecc = device.compute_capabilitydriver.py:591:self.compute_capability < MIN_REQUIRED_CCdriver.py:622:self.compute_capability >= (5, 3)- Multiple test files perform similar comparisons
After line 213 executes, cc becomes a 3-tuple, which will break tuple comparisons. For example, (9, 0, "a") >= (9, 0) evaluates to True in Python, but (8, 0, "") >= (9, 0) would incorrectly evaluate to False due to string comparison.
The modification should only happen locally within this method and not affect the device's compute_capability property. Consider storing the arch-specific version separately or ensure all comparison sites are updated to handle 3-tuples.
There was a problem hiding this comment.
The device's compute capability is not being affected.
| msg = ( | ||
| "GPU compute capability %d.%d is not supported" | ||
| "(requires >=%d.%d)" % (mycc + cc) | ||
| "(requires >=%d.%d)" % (cc + supported_cc) | ||
| ) |
There was a problem hiding this comment.
Tuple concatenation syntax error will cause runtime crash. The expression cc + supported_cc attempts to concatenate two tuples for use in string formatting with %, but the format string expects 4 separate values, not a 4-tuple.
This should be:
| msg = ( | |
| "GPU compute capability %d.%d is not supported" | |
| "(requires >=%d.%d)" % (mycc + cc) | |
| "(requires >=%d.%d)" % (cc + supported_cc) | |
| ) | |
| msg = ( | |
| "GPU compute capability %d.%d is not supported" | |
| "(requires >=%d.%d)" % (cc[0], cc[1], supported_cc[0], supported_cc[1]) | |
| ) |
Note: This bug exists in both the original and modified code at this location. The refactoring changed the parameter name from mycc to cc but didn't fix the pre-existing bug.
There was a problem hiding this comment.
I think this is correct, but fixing it is low-priority and should be done outside this PR.
There was a problem hiding this comment.
Upon re-reading the comment, I think it is bogus.
| if cufunc: | ||
| return cufunc | ||
| cubin = self.get_cubin(cc=device.compute_capability) | ||
| cubin = self.get_cubin() |
There was a problem hiding this comment.
The removal of the explicit cc=device.compute_capability parameter may cause issues. The get_cubin() method will call _ensure_cc(None), which will fetch device.compute_capability and convert it to a 3-tuple with the arch suffix.
However, if the device's compute capability is cached elsewhere or accessed later, there's an inconsistency: the device object still reports a 2-tuple, but compilation uses a 3-tuple. Verify that cached cubins are properly keyed by the 3-tuple to avoid cache misses or incorrect binary selection.
There was a problem hiding this comment.
The cc argument was removed here because it was the same as the default that got used in _ensure_cc anyway. So this removal saved making the same change in two different places.
| if config.FORCE_CUDA_CC: | ||
| arch = config.FORCE_CUDA_CC | ||
| major, minor = config.FORCE_CUDA_CC | ||
| else: | ||
| arch = find_closest_arch((major, minor)) | ||
| return "compute_%d%d" % arch | ||
| major, minor = find_closest_arch((major, minor)) |
There was a problem hiding this comment.
When config.FORCE_CUDA_CC is set, the arch parameter is discarded. If a user explicitly passes a 3-tuple with an arch suffix (e.g., (9, 0, "a")), the arch suffix will be lost when FORCE_CUDA_CC overrides the major/minor values.
Consider preserving the arch suffix:
| if config.FORCE_CUDA_CC: | |
| arch = config.FORCE_CUDA_CC | |
| major, minor = config.FORCE_CUDA_CC | |
| else: | |
| arch = find_closest_arch((major, minor)) | |
| return "compute_%d%d" % arch | |
| major, minor = find_closest_arch((major, minor)) | |
| if config.FORCE_CUDA_CC: | |
| major, minor = config.FORCE_CUDA_CC | |
| # arch parameter is preserved from the function argument | |
| else: |
This ensures that if someone calls get_arch_option(9, 0, "a") with FORCE_CUDA_CC=(9, 0), they still get "compute_90a" instead of "compute_90". Should the arch suffix be preserved when FORCE_CUDA_CC is set, or is it intentional to reset it?
There was a problem hiding this comment.
I think arch should be discarded when FORCE_CUDA_CC is set but it isn't here. This should be fixed, but not in the way suggested by greptile.
| set_descriptor = cuda.declare_device( | ||
| "set_descriptor", | ||
| types.int32(types.CPointer(types.int32)), | ||
| link=[set_desc], |
There was a problem hiding this comment.
The set_descriptor function has an unused out parameter and doesn't actually set any descriptor—it only executes a tensormap.replace instruction. The function signature and implementation don't match the apparent intent.
If this is intentional for testing that the architecture-specific instruction compiles and runs, consider renaming the function to reflect what it actually does, or remove the unused parameter.
There was a problem hiding this comment.
It's intentional for testing.
There was a problem hiding this comment.
Also, this is updating a TMA descriptor, so the name seems appropriate.
When `config.FORCE_CUDA_CC` is set, the arch option from it should be used regardless of what is passed into `get_arch_option()`. Also, if the discovered CC is not an exact match for the requested one, it is not valid to use arch-specific versions because the arch-specific PTX will not be compatible with the device being targeted.
…h_specific_target
|
/ok to test |
Greptile OverviewGreptile SummaryThis PR adds support for architecture-specific CUDA compute capability targets (e.g., Implementation OverviewCore Changes:
Critical Blocking Issues1. Cache Key Inconsistency (CRITICAL - Already Noted)
2. magic_tuple() Mismatch (CRITICAL - Already Noted)
3. NRT Module Inconsistency (NEW CRITICAL ISSUE)The
4. Backward Compatibility RiskSerialized code libraries (pickle) use cc tuples as cache keys. Pre-PR caches have 2-tuple keys, post-PR can have 3-tuple keys. This breaks cache compatibility across versions for CC >= 9.0 devices. Required Fixes
The PR's goal is sound and tests are well-designed, but the implementation has critical bugs that will cause compilation cache misses and inconsistent behavior across the codebase. Confidence Score: 1/5
Important Files ChangedFile Analysis
|
| set_descriptor = cuda.declare_device( | ||
| "set_descriptor", | ||
| types.int32(types.CPointer(types.int32)), | ||
| link=[set_desc], |
There was a problem hiding this comment.
[P0] Function signature mismatch causes runtime error. The C function set_descriptor takes 2 parameters (int *out, int* smem), but the declared signature only specifies 1 parameter. Change to types.int32(types.CPointer(types.int32), types.CPointer(types.int32)).
| def kernel(a): | ||
| sm = cuda.shared.array(1, dtype=np.int32) | ||
| data_ptr = ffi.from_buffer(sm) | ||
| set_descriptor(data_ptr) |
There was a problem hiding this comment.
[P0] Function call missing required parameter. set_descriptor expects 2 arguments but only receives 1. Should be set_descriptor(data_ptr, data_ptr) or similar based on the intended semantics.
Additional Comments (1)
|
Greptile OverviewGreptile SummaryOverviewThis PR adds support for architecture-specific CUDA compute capability targets (e.g., sm_90a, sm_100f) to enable use of arch-specific features in CUDA kernels. The implementation extends compute capability tuples from 2-element Key Changes
Critical Issues Found
Recommendations
Confidence Score: 3/5
Important Files ChangedFile Analysis
|
| extern "C" __device__ | ||
| int set_descriptor(int *out, int* smem) { | ||
| unsigned usmem = __cvta_generic_to_shared(smem); | ||
| asm volatile("tensormap.replace.tile.rank.shared::cta.b1024.b32 [%0], 2;" :: "r"(usmem)); | ||
| return 0; | ||
| } | ||
| """) | ||
|
|
||
| set_descriptor = cuda.declare_device( | ||
| "set_descriptor", | ||
| types.int32(types.CPointer(types.int32)), | ||
| link=[set_desc], |
There was a problem hiding this comment.
The C function set_descriptor declares two parameters (int *out, int* smem) but the Numba declaration only specifies one parameter types.CPointer(types.int32), and only one argument is passed at line 425. This is a parameter count mismatch.
Looking at the C implementation, the out parameter is never used - only smem is used. The function signature should be corrected to match the actual usage:
extern "C" __device__
int set_descriptor(int* smem) {
// ... rest of implementation
}Or if out is intended to be used, the Numba declaration and call site need to be updated to pass both parameters.
There was a problem hiding this comment.
This is because of the Numba ABI.
| # being optimized away. | ||
| major, minor = found | ||
| major, minor = found[0], found[1] | ||
| cc_arch = found[2] if len(found) == 3 else "" |
There was a problem hiding this comment.
The check if len(found) == 3 is redundant because _verify_cc_tuple always returns a 3-tuple (see lines 60 and 66 in the same file). Consider simplifying to:
| cc_arch = found[2] if len(found) == 3 else "" | |
| cc_arch = found[2] |
This makes the code clearer and avoids the unnecessary conditional.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| def _verify_cc_tuple(cc): | ||
| version = _get_nvrtc_version() | ||
| ver_str = lambda version: ".".join(str(v) for v in version) | ||
|
|
||
| if len(cc) == 3: | ||
| cc, arch = (cc[0], cc[1]), cc[2] | ||
| else: | ||
| arch = "" | ||
|
|
||
| if arch not in ("", "a", "f"): | ||
| raise ValueError( | ||
| f"Invalid architecture suffix '{arch}' in compute capability " | ||
| f"{ver_str(cc)}{arch}. Expected '', 'a', or 'f'." | ||
| ) |
There was a problem hiding this comment.
There's no validation that cc is a tuple with length 2 or 3. If an invalid input like cc=(9,) (1-tuple) or cc=(9, 0, "a", "extra") (4-tuple) is passed, the function will fail with unclear error messages.
Consider adding validation at the start of the function:
if not isinstance(cc, tuple) or len(cc) not in (2, 3):
raise ValueError(
f"Compute capability must be a 2-tuple (major, minor) or "
f"3-tuple (major, minor, arch), got {cc}"
)| if len(cc) == 3: | ||
| arch = f"sm_{cc[0]}{cc[1]}{cc[2]}" | ||
| else: | ||
| arch = f"sm_{cc[0]}{cc[1]}" |
There was a problem hiding this comment.
Similar to the nvrtc.py issue, there's no validation that cc is a properly formed tuple. Consider adding a check:
if not isinstance(cc, tuple) or len(cc) not in (2, 3):
raise ValueError(f"cc must be a 2-tuple or 3-tuple, got {cc}")This will provide clearer error messages if invalid input is passed.
Additional Comments (1)
This means:
Additionally, there's an inconsistency: Consider:
|
|
/ok to test |
| #include <cuda_fp16.h> | ||
|
|
||
| extern "C" __device__ | ||
| int set_descriptor(int *out, int* smem) { |
There was a problem hiding this comment.
The C function signature has 2 parameters (int *out, int* smem) but the declaration on line 415 specifies only 1 parameter (types.CPointer(types.int32)). This is a critical mismatch that will cause undefined behavior.
The first parameter int *out is never used in the C code and should be removed. The correct C signature should be:
| int set_descriptor(int *out, int* smem) { | |
| int set_descriptor(int* smem) { |
This makes the C signature match the Python declaration which expects a single pointer parameter.
| major, minor = found[0], found[1] | ||
| cc_arch = found[2] if len(found) == 3 else "" |
There was a problem hiding this comment.
Unnecessary defensive check: _verify_cc_tuple() always returns a 3-tuple (see lines 60 and 66), so len(found) == 3 will always be True. This check is harmless but redundant.
If you want to keep it for safety, consider simplifying to just found[2] without the conditional, or add a comment explaining why the defensive check exists.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| device = devices.get_context().device | ||
| return device.compute_capability | ||
| cc = device.compute_capability | ||
| cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "") |
There was a problem hiding this comment.
Hardcoded logic for automatic arch suffix assignment: The code automatically assigns "a" suffix for all cc >= 9.0. However:
- Not all GPUs with cc >= 9.0 necessarily have an "a" variant
- The logic doesn't handle "f" variants (like sm_100f mentioned in tests)
- Users might want cc >= 9.0 without arch suffix
Consider either:
- Making this configurable via an environment variable/config option
- Or providing a way to explicitly opt-out of automatic arch suffix
- Or documenting this behavior clearly so users understand that cc >= 9.0 automatically gets "a" suffix
Additional Comments (3)
This creates inconsistent cc tuple lengths throughout the codebase. All code consuming cc tuples needs to handle both 2 and 3-element tuples defensively. Recommendation: Normalize all cc tuples to 3-element format in def _ensure_cc(self, cc):
if cc is not None:
# Normalize to 3-tuple format
if len(cc) == 2:
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
return cc
device = devices.get_context().device
cc = device.compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
return cc
This means:
This breaks cache lookups and invalidation logic. The fix should apply the same transformation as
When deserializing old caches with the new code, or vice versa, cache lookups will fail because Consider:
|
|
/ok to test |
| set_descriptor = cuda.declare_device( | ||
| "set_descriptor", | ||
| types.int32(types.CPointer(types.int32)), | ||
| link=[set_desc], | ||
| ) |
There was a problem hiding this comment.
The function signature is incorrect. The set_descriptor function is declared to take 2 parameters (int *out, int* smem), but the C code at line 406 only declares ONE parameter (int* smem). The out parameter is never used in the C implementation.
The signature should be:
| set_descriptor = cuda.declare_device( | |
| "set_descriptor", | |
| types.int32(types.CPointer(types.int32)), | |
| link=[set_desc], | |
| ) | |
| set_descriptor = cuda.declare_device( | |
| "set_descriptor", | |
| types.int32(types.CPointer(types.int32)), | |
| link=[set_desc], | |
| ) |
Or if two parameters are needed, the C code should be updated to actually use both parameters.
| cufunc = self._cufunc_cache.get(device.id, None) | ||
| if cufunc: | ||
| return cufunc | ||
| cubin = self.get_cubin(cc=device.compute_capability) | ||
| cubin = self.get_cubin() |
There was a problem hiding this comment.
This call was changed from self.get_cubin(cc=device.compute_capability) to self.get_cubin(). While this change ensures the arch suffix is automatically added via _ensure_cc(), it could have subtle implications:
- Previously, this explicitly passed the device's CC, which was always a 2-tuple
- Now it relies on
_ensure_cc(None)to query and augment the CC
This change is likely intentional to ensure arch-specific compilation, but it should be noted that this modifies the cache key used for cubin lookups when cc >= 9.0. Old cached cubins compiled before this PR won't be found (cache invalidation).
Additional Comments (2)
When cc >= 9.0:
This means cached compilations will never be found, causing unnecessary recompilation every time. The fix is to apply the same transformation in
This creates inconsistent tuple lengths throughout the codebase. All downstream code must defensively handle both 2 and 3-element tuples. Additionally, user-provided 3-element tuples with invalid arch suffixes are not validated here (validation only happens in Recommendation: Normalize ALL cc tuples to 3-element format here: |
|
/ok to test |
|
/ok to test |
| if found != cc: | ||
| found = (found[0], found[1], arch) | ||
| warnings.warn( | ||
| f"Device compute capability {ver_str(cc)} is not supported by " | ||
| f"NVRTC {ver_str(version)}. Using {ver_str(found)} instead." |
There was a problem hiding this comment.
Misleading warning message. When found != cc (line 59), it means NVRTC doesn't support the exact CC version, so it falls back to a lower supported version. However, the warning message on lines 62-63 says "Device compute capability ... is not supported" which is confusing because:
- It's actually the NVRTC toolkit that doesn't support it, not the device
- The warning talks about "Using ... instead" but the arch suffix is preserved from the input, which might not make sense for a downgraded CC
For example, if a user requests CC 9.5 with "a" suffix on a toolkit that only supports up to 9.0, the code will return (9, 0, "a"). But the warning message is misleading about what's happening.
Suggestion: Make the warning more accurate:
warnings.warn(
f"Compute capability {ver_str(cc)}{arch} is not supported by "
f"NVRTC {ver_str(version)}. Downgrading to {ver_str(found)} "
f"(closest supported version) with arch suffix '{arch}'."
)| if len(cc) == 3: | ||
| cc, arch = (cc[0], cc[1]), cc[2] | ||
| else: | ||
| arch = "" |
There was a problem hiding this comment.
Variable shadowing issue: Line 38 reassigns cc from a 3-tuple to a 2-tuple (cc[0], cc[1]), extracting the arch suffix separately. This shadows the input parameter cc and makes the code harder to follow.
Additionally, this creates a potential bug on line 50 where cc (now the 2-tuple) is used in the filter comparison, but the original 3-tuple meaning is lost. While this works correctly because we're only comparing major/minor versions, the code would be clearer with distinct variable names.
Suggestion: Use distinct variable names:
if len(cc) == 3:
cc_base = (cc[0], cc[1])
arch = cc[2]
else:
cc_base = cc
arch = ""
# Then use cc_base throughout for comparisons
found = max(filter(lambda v: v <= cc_base, [v for v in supported_ccs]))| def get_arch_option(major, minor, arch=""): | ||
| """Matches with the closest architecture option""" | ||
| if config.FORCE_CUDA_CC: | ||
| arch = config.FORCE_CUDA_CC | ||
| fcc = config.FORCE_CUDA_CC | ||
| major, minor = fcc[0], fcc[1] | ||
| if len(fcc) == 3: | ||
| arch = fcc[2] | ||
| else: | ||
| arch = "" | ||
| else: | ||
| arch = find_closest_arch((major, minor)) | ||
| return "compute_%d%d" % arch | ||
| new_major, new_minor = find_closest_arch((major, minor)) | ||
| if (new_major, new_minor) != (major, minor): | ||
| # If we picked a different major / minor, then using an | ||
| # arch-specific version is invalid | ||
| if arch != "": | ||
| raise ValueError( | ||
| f"Can't use arch-specific compute_{major}{minor}{arch} with " | ||
| "closest found compute capability " | ||
| f"compute_{new_major}{new_minor}" | ||
| ) | ||
| major, minor = new_major, new_minor | ||
|
|
||
| return f"compute_{major}{minor}{arch}" |
There was a problem hiding this comment.
Logic inconsistency: When config.FORCE_CUDA_CC is set (lines 216-221), the function extracts the arch suffix from fcc and assigns it to the arch parameter. However, this overwrites the arch parameter that was passed into the function, which could be unexpected behavior.
Example problematic scenario:
# User calls: get_arch_option(9, 0, "f")
# With FORCE_CUDA_CC = (9, 0, "a")
# Result: returns "compute_90a" (ignoring the "f" that was passed in)The function signature suggests arch="" is a parameter the caller can control, but it gets overwritten when FORCE_CUDA_CC is set.
Recommendation: Either:
- Document that
archparameter is ignored whenFORCE_CUDA_CCis set, OR - Respect the passed-in
archparameter and only use the major/minor fromFORCE_CUDA_CC
Based on the test at line 443-446 which expects the FORCE_CUDA_CC arch to be used, option 1 seems intended, so add documentation.
Additional Comments (2)
This means NRT won't benefit from arch-specific features on CC >= 9.0 devices, and could potentially cause compatibility issues when linking together. Fix: Apply the same transformation as in cc = get_current_device().compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
Scenario 1: User calls with
Scenario 2: User calls with
This inconsistency causes multiple problems:
Fix: Normalize ALL cc tuples consistently: def _ensure_cc(self, cc):
if cc is not None:
# Normalize user-provided tuple to 3 elements
if len(cc) == 2:
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
return cc
device = devices.get_context().device
cc = device.compute_capability
cc = (cc[0], cc[1], "a" if cc >= (9, 0) else "")
return cc |
- Add arch specific target support (NVIDIA#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714) - ci: relock pixi (NVIDIA#712) - ci: remove redundant conda build in ci (NVIDIA#711) - chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694) - chore: perf lint (NVIDIA#697) - chore(deps): bump deps in pixi lockfile (NVIDIA#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682) - Support python `3.14` (NVIDIA#599) - Remove customized address space tracking and address class emission in debug info (NVIDIA#669) - Drop `experimental` from cuda.core namespace imports (NVIDIA#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675) - Use `rapidsai/sccache` in CI (NVIDIA#674) - chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670) - Set up a new VM-based CI infrastructure (NVIDIA#604)
- Add arch specific target support (#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (#714) - ci: relock pixi (#712) - ci: remove redundant conda build in ci (#711) - chore(deps): bump numba-cuda version and relock pixi (#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (#694) - chore: perf lint (#697) - chore(deps): bump deps in pixi lockfile (#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (#682) - Support python `3.14` (#599) - Remove customized address space tracking and address class emission in debug info (#669) - Drop `experimental` from cuda.core namespace imports (#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675) - Use `rapidsai/sccache` in CI (#674) - chore(dev-deps): remove ipython and pyinstrument (#670) - Set up a new VM-based CI infrastructure (#604)
WIP! Suppose to work, need to add some tests.
Closes: #469