Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

PR #609 made some changes to the way modules were loaded that results in the wrong object being passed to cuOccupancyMaxPotentialBlockSize (previously a CUFunction and now a CUKernel). This causes the max block size calculation to fail after eventually getting the wrong object and leads to a CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES on certain GPUs. This is observable on a V100 with a resource hungry kernel:

python -m numba.runtests numba.cuda.tests.cudapy.test_gufunc.TestCUDAGufunc.test_gufunc_small
cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES: This indicates that a launch did not occur because it did not have appropriate resources. 

This PR removes the numba-cuda native maximum threads per block computation machinery and routes through cuda-python APIs to get the same information.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 23, 2026

Greptile Overview

Greptile Summary

This PR fixes a CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES crash introduced in PR #609 where the wrong object type (CUKernel instead of CUFunction) was being passed to the CUDA occupancy API. The fix removes the native numba-cuda occupancy calculation machinery and replaces it with a simpler approach that returns the hardware maximum threads per block via func.kernel.attributes.max_threads_per_block().

Key changes:

  • Replaced cuOccupancyMaxPotentialBlockSize call with max_threads_per_block() attribute access
  • Removed helper methods _ctypes_max_potential_block_size and _cuda_python_max_potential_block_size
  • Removed test that validated the old occupancy calculation behavior
  • Removed unused imports (c_size_t, cu_occupancy_b2d_size)

Trade-off: The new implementation fixes the crash but changes the semantics of get_max_potential_block_size(). The method now returns the hardware maximum block size rather than computing an optimal block size based on occupancy, shared memory usage, and resource constraints. All parameters (b2d_func, memsize, blocksizelimit, flags) are now ignored. This is a pragmatic fix that resolves the immediate crash, though it may result in suboptimal block sizes for kernels with significant resource usage.

Confidence Score: 3/5

  • Safe to merge as a pragmatic fix for a critical crash, but with known trade-offs in optimality
  • The PR successfully fixes the CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES crash by avoiding the type mismatch issue. However, the fix changes the behavior from computing optimal block sizes to returning hardware maximums, ignoring occupancy calculations and resource constraints. This is acceptable as a quick fix for the crash but may lead to suboptimal performance for resource-intensive kernels. The test removal is appropriate since it tested the old implementation.
  • Pay attention to numba_cuda/numba/cuda/cudadrv/driver.py - the simplified implementation may need future refinement to restore occupancy-aware block size calculation

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/cudadrv/driver.py Simplified get_max_potential_block_size to return hardware maximum instead of computing optimal block size via occupancy API. Fixes crash but changes semantics - ignores shared memory and occupancy parameters.
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py Removed test for old get_max_potential_block_size implementation that tested occupancy calculation with dynamic shared memory callback.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@cpcloud
Copy link
Contributor

cpcloud commented Jan 23, 2026

It looks like this is still in use in cudf. Perhaps we can just fix it as is and keep it around until it can be adjusted downstream in cudf?

@gmarkall
Copy link
Contributor

Perhaps we can just fix it as is and keep it around until it can be adjusted downstream in cudf?

This seems like a good compromise.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

else:
args.append(flags)
return driver.cuOccupancyMaxPotentialBlockSizeWithFlags(*args)
return func.kernel.attributes.max_threads_per_block()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_threads_per_block() returns the maximum supported threads per block for this kernel based on hardware/resource limits, which is different from the optimal block size for occupancy that cuOccupancyMaxPotentialBlockSize computes. The original implementation calculated the optimal block size considering dynamic shared memory (b2d_func, memsize), block size limits (blocksizelimit), and flags - all of which are now ignored. This could lead to using the maximum block size (e.g., 1024) even when a smaller size would be more efficient or when resources are constrained.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1176 to +1179
return (
binding.CUresult.CUDA_SUCCESS,
func.kernel.attributes.max_threads_per_block(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignores b2d_func, memsize, blocksizelimit, and flags parameters. Previous implementation computed optimal block size via cuOccupancyMaxPotentialBlockSize considering dynamic shared memory. Now returns hardware maximum, which may be suboptimal for kernels with significant shared memory usage.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cpcloud
Copy link
Contributor

cpcloud commented Jan 26, 2026

The failing CI jobs are flakes.

  1. Build win-64 seems like a transient GitHub error
  2. Test linux-64 is related to IPC timeouts (these are somewhat flaky)
  3. test-conda is [BUG] Use of weakref.proxy for rtsys._memsys_module appears to be incorrect #573

@brandon-b-miller brandon-b-miller merged commit f0c9320 into NVIDIA:main Jan 26, 2026
190 of 193 checks passed
@brandon-b-miller brandon-b-miller deleted the fix-max-block-size branch January 26, 2026 16:48
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jan 27, 2026
- Add Python 3.14 to the wheel publishing matrix (NVIDIA#750)
- feat: swap out internal device array usage with `StridedMemoryView` (NVIDIA#703)
- Fix max block size computation in `forall` (NVIDIA#744)
- Fix prologue debug line info pointing to decorator instead of def line (NVIDIA#746)
- Fix kernel return type in DISubroutineType debug metadata (NVIDIA#745)
- Fix missing line info in Jupyter notebooks (NVIDIA#742)
- Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (NVIDIA#698)
- chore(deps): add cuda-pathfinder to pixi deps (NVIDIA#741)
- fix: enable flake8-bugbear lints and fix found problems (NVIDIA#708)
- fix: Fix race condition in CUDA Simulator (NVIDIA#690)
- ci: run tests in parallel (NVIDIA#740)
- feat: users can pass `shared_memory_carveout` to @cuda.jit (NVIDIA#642)
- Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (NVIDIA#739)
- Pass the -numba-debug flag to libnvvm (NVIDIA#681)
- ci: remove rapids containers from conda ci (NVIDIA#737)
- Use `pathfinder` for dynamic libraries (NVIDIA#308)
- CI: Add CUDA 13.1 testing support (NVIDIA#705)
- Adding `pixi run test` and `pixi run test-par` support (NVIDIA#724)
- Disable per-PR nvmath tests + follow same test practice (NVIDIA#723)
- chore(deps): regenerate pixi lockfile (NVIDIA#722)
- Fix DISubprogram line number to point to function definition line (NVIDIA#695)
- revert: chore(dev): build pixi using rattler (NVIDIA#713) (NVIDIA#719)
- [feat] Initial version of the Numba CUDA GDB pretty-printer (NVIDIA#692)
- chore(dev): build pixi using rattler (NVIDIA#713)
- build(deps): bump the actions-monthly group across 1 directory with 8 updates (NVIDIA#704)
@gmarkall gmarkall mentioned this pull request Jan 27, 2026
kkraus14 pushed a commit that referenced this pull request Jan 28, 2026
- Add Python 3.14 to the wheel publishing matrix (#750)
- feat: swap out internal device array usage with `StridedMemoryView`
(#703)
- Fix max block size computation in `forall` (#744)
- Fix prologue debug line info pointing to decorator instead of def line
(#746)
- Fix kernel return type in DISubroutineType debug metadata (#745)
- Fix missing line info in Jupyter notebooks (#742)
- Fix: Pass correct flags to linker when debugging in the presence of
LTOIR code (#698)
- chore(deps): add cuda-pathfinder to pixi deps (#741)
- fix: enable flake8-bugbear lints and fix found problems (#708)
- fix: Fix race condition in CUDA Simulator (#690)
- ci: run tests in parallel (#740)
- feat: users can pass `shared_memory_carveout` to @cuda.jit (#642)
- Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (#739)
- Pass the -numba-debug flag to libnvvm (#681)
- ci: remove rapids containers from conda ci (#737)
- Use `pathfinder` for dynamic libraries (#308)
- CI: Add CUDA 13.1 testing support (#705)
- Adding `pixi run test` and `pixi run test-par` support (#724)
- Disable per-PR nvmath tests + follow same test practice (#723)
- chore(deps): regenerate pixi lockfile (#722)
- Fix DISubprogram line number to point to function definition line
(#695)
- revert: chore(dev): build pixi using rattler (#713) (#719)
- [feat] Initial version of the Numba CUDA GDB pretty-printer (#692)
- chore(dev): build pixi using rattler (#713)
- build(deps): bump the actions-monthly group across 1 directory with 8
updates (#704)

<!--

Thank you for contributing to numba-cuda :)

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are
being
   made.

2. Please ensure that you have written units tests for the changes
made/features
   added.

3. If you are closing an issue please use one of the automatic closing
words as
noted here:
https://help.github.com/articles/closing-issues-using-keywords/

4. If your pull request is not ready for review but you want to make use
of the
continuous integration testing facilities please label it with `[WIP]`.

5. If your pull request is ready to be reviewed without requiring
additional
work on top of it, then remove the `[WIP]` label (if present) and
replace
it with `[REVIEW]`. If assistance is required to complete the
functionality,
for example when the C/C++ code of a feature is complete but Python
bindings
are still required, then add the label `[HELP-REQ]` so that others can
triage
and assist. The additional changes then can be implemented on top of the
same PR. If the assistance is done by members of the rapidsAI team, then
no
additional actions are required by the creator of the original PR for
this,
otherwise the original author of the PR needs to give permission to the
person(s) assisting to commit to their personal fork of the project. If
that
doesn't happen then a new PR based on the code of the original PR can be
opened by the person assisting, which then will be the PR that will be
   merged.

6. Once all work has been done and review has taken place please do not
add
features or make changes out of the scope of those requested by the
reviewer
(doing this just add delays as already reviewed code ends up having to
be
re-reviewed/it is hard to tell what is new etc!). Further, please do not
rebase your branch on main/force push/rewrite history, doing any of
these
   causes the context of any comments made by reviewers to be lost. If
   conflicts occur against main they should be resolved by merging main
   into the branch used for making the pull request.

Many thanks in advance for your cooperation!

-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants