-
Notifications
You must be signed in to change notification settings - Fork 54
Fix max block size computation in forall
#744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix max block size computation in forall
#744
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
Greptile OverviewGreptile SummaryThis PR fixes a Key changes:
Trade-off: The new implementation fixes the crash but changes the semantics of Confidence Score: 3/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
|
/ok to test |
cpcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
It looks like this is still in use in cudf. Perhaps we can just fix it as is and keep it around until it can be adjusted downstream in cudf? |
This seems like a good compromise. |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| else: | ||
| args.append(flags) | ||
| return driver.cuOccupancyMaxPotentialBlockSizeWithFlags(*args) | ||
| return func.kernel.attributes.max_threads_per_block() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_threads_per_block() returns the maximum supported threads per block for this kernel based on hardware/resource limits, which is different from the optimal block size for occupancy that cuOccupancyMaxPotentialBlockSize computes. The original implementation calculated the optimal block size considering dynamic shared memory (b2d_func, memsize), block size limits (blocksizelimit), and flags - all of which are now ignored. This could lead to using the maximum block size (e.g., 1024) even when a smaller size would be more efficient or when resources are constrained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| return ( | ||
| binding.CUresult.CUDA_SUCCESS, | ||
| func.kernel.attributes.max_threads_per_block(), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignores b2d_func, memsize, blocksizelimit, and flags parameters. Previous implementation computed optimal block size via cuOccupancyMaxPotentialBlockSize considering dynamic shared memory. Now returns hardware maximum, which may be suboptimal for kernels with significant shared memory usage.
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
|
The failing CI jobs are flakes.
|
- Add Python 3.14 to the wheel publishing matrix (NVIDIA#750) - feat: swap out internal device array usage with `StridedMemoryView` (NVIDIA#703) - Fix max block size computation in `forall` (NVIDIA#744) - Fix prologue debug line info pointing to decorator instead of def line (NVIDIA#746) - Fix kernel return type in DISubroutineType debug metadata (NVIDIA#745) - Fix missing line info in Jupyter notebooks (NVIDIA#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (NVIDIA#698) - chore(deps): add cuda-pathfinder to pixi deps (NVIDIA#741) - fix: enable flake8-bugbear lints and fix found problems (NVIDIA#708) - fix: Fix race condition in CUDA Simulator (NVIDIA#690) - ci: run tests in parallel (NVIDIA#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (NVIDIA#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (NVIDIA#739) - Pass the -numba-debug flag to libnvvm (NVIDIA#681) - ci: remove rapids containers from conda ci (NVIDIA#737) - Use `pathfinder` for dynamic libraries (NVIDIA#308) - CI: Add CUDA 13.1 testing support (NVIDIA#705) - Adding `pixi run test` and `pixi run test-par` support (NVIDIA#724) - Disable per-PR nvmath tests + follow same test practice (NVIDIA#723) - chore(deps): regenerate pixi lockfile (NVIDIA#722) - Fix DISubprogram line number to point to function definition line (NVIDIA#695) - revert: chore(dev): build pixi using rattler (NVIDIA#713) (NVIDIA#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (NVIDIA#692) - chore(dev): build pixi using rattler (NVIDIA#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (NVIDIA#704)
- Add Python 3.14 to the wheel publishing matrix (#750) - feat: swap out internal device array usage with `StridedMemoryView` (#703) - Fix max block size computation in `forall` (#744) - Fix prologue debug line info pointing to decorator instead of def line (#746) - Fix kernel return type in DISubroutineType debug metadata (#745) - Fix missing line info in Jupyter notebooks (#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (#698) - chore(deps): add cuda-pathfinder to pixi deps (#741) - fix: enable flake8-bugbear lints and fix found problems (#708) - fix: Fix race condition in CUDA Simulator (#690) - ci: run tests in parallel (#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (#739) - Pass the -numba-debug flag to libnvvm (#681) - ci: remove rapids containers from conda ci (#737) - Use `pathfinder` for dynamic libraries (#308) - CI: Add CUDA 13.1 testing support (#705) - Adding `pixi run test` and `pixi run test-par` support (#724) - Disable per-PR nvmath tests + follow same test practice (#723) - chore(deps): regenerate pixi lockfile (#722) - Fix DISubprogram line number to point to function definition line (#695) - revert: chore(dev): build pixi using rattler (#713) (#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (#692) - chore(dev): build pixi using rattler (#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (#704) <!-- Thank you for contributing to numba-cuda :) Here are some guidelines to help the review process go smoothly. 1. Please write a description in this text box of the changes that are being made. 2. Please ensure that you have written units tests for the changes made/features added. 3. If you are closing an issue please use one of the automatic closing words as noted here: https://help.github.com/articles/closing-issues-using-keywords/ 4. If your pull request is not ready for review but you want to make use of the continuous integration testing facilities please label it with `[WIP]`. 5. If your pull request is ready to be reviewed without requiring additional work on top of it, then remove the `[WIP]` label (if present) and replace it with `[REVIEW]`. If assistance is required to complete the functionality, for example when the C/C++ code of a feature is complete but Python bindings are still required, then add the label `[HELP-REQ]` so that others can triage and assist. The additional changes then can be implemented on top of the same PR. If the assistance is done by members of the rapidsAI team, then no additional actions are required by the creator of the original PR for this, otherwise the original author of the PR needs to give permission to the person(s) assisting to commit to their personal fork of the project. If that doesn't happen then a new PR based on the code of the original PR can be opened by the person assisting, which then will be the PR that will be merged. 6. Once all work has been done and review has taken place please do not add features or make changes out of the scope of those requested by the reviewer (doing this just add delays as already reviewed code ends up having to be re-reviewed/it is hard to tell what is new etc!). Further, please do not rebase your branch on main/force push/rewrite history, doing any of these causes the context of any comments made by reviewers to be lost. If conflicts occur against main they should be resolved by merging main into the branch used for making the pull request. Many thanks in advance for your cooperation! -->
PR #609 made some changes to the way modules were loaded that results in the wrong object being passed to
cuOccupancyMaxPotentialBlockSize(previously aCUFunctionand now aCUKernel). This causes the max block size calculation to fail after eventually getting the wrong object and leads to aCUDA_ERROR_LAUNCH_OUT_OF_RESOURCESon certain GPUs. This is observable on a V100 with a resource hungry kernel:This PR removes the
numba-cudanative maximum threads per block computation machinery and routes throughcuda-pythonAPIs to get the same information.