-
Notifications
You must be signed in to change notification settings - Fork 54
Migrate numba-cuda driver to use cuda.core.launch API #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate numba-cuda driver to use cuda.core.launch API #609
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test 8f72cb7 |
Greptile OverviewGreptile SummaryThis PR migrates numba-cuda's kernel launching infrastructure from the legacy
Confidence Score: 4/5
Important Files ChangedFile Analysis
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
|
/ok to test 015a37c |
|
/ok to test aa92ce5 |
aa92ce5 to
b95207c
Compare
|
/ok to test b95207c |
|
/ok to test 7917bb1 |
|
/ok to test fa14321 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
|
|
It would be good to run the launch benchmarks to quantify this change (edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
I'm having issues getting the cupy benchmarks to run successfully but here is what I can report now. |
|
/ok to test |
|
/ok to test 9421787 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, no comments
|
After some more investigation, it looks like Is caching the numba Stream -> cuda core Stream conversion a viable option? I will also look into whether we can move the conversion higher up in the loop. |
|
It looks like streams are already being cached as part of the |
|
Ok, moving the core stream conversion into the |
|
Naturally, that breaks pickling |
|
Pushed up a possible implementation of |
fca2332 to
83fb41a
Compare
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, no comments
|
One thing that also may have gotten lost in my sea of comments is that there'll be some slight improvements from the next release of |
|
/ok to test |
|
Just to confirm, we aren't able to reliably reproduce the 30%+ performance regression once we eliminated the overhead by the stream creation helper? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, no comments
|
That is correct. There's still overhead to it, but it's not being invoked on every single launch call. It happens when you slice into the function to get the grid and block dimensions. Before: arr = cuda.device_array(...)
func = one_arg[1, 1]
for _ in range(HUGE_NUMBER):
func(arr) # _to_core_stream invoked hereAfter: arr = cuda.device_array(...)
func = one_arg[1, 1] # _to_core_stream invoked here
for _ in range(HUGE_NUMBER):
func(arr) |
kkraus14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)
v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
PR #609 made some changes to the way modules were loaded that results in the wrong object being passed to `cuOccupancyMaxPotentialBlockSize` (previously a `CUFunction` and now a `CUKernel`). This causes the max block size calculation to fail after eventually getting the wrong object and leads to a `CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES` on certain GPUs. This is observable on a V100 with a resource hungry kernel: ``` python -m numba.runtests numba.cuda.tests.cudapy.test_gufunc.TestCUDAGufunc.test_gufunc_small ``` ``` cuda.core._utils.cuda_utils.CUDAError: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES: This indicates that a launch did not occur because it did not have appropriate resources. ``` This PR removes the `numba-cuda` native maximum threads per block computation machinery and routes through `cuda-python` APIs to get the same information.



PR #609: Migrate kernel launch to cuda.core
TL;DR: Migrates numba-cuda's kernel launching from legacy
launch_kernelAPI tocuda.core.experimental.launch. Performance concerns addressed through optimization work. Tests pass. Removes old abstractions.What Changed
Core Files Modified:
driver.py- Major surgery herelaunch_kernel(), abstractModule/Functionbase classes,CtypesModule,CtypesFunction_to_core_stream()helper for stream conversionCudaPythonModulenow wrapsObjectCodedirectlycuLibraryGetGlobal/cuLibraryUnloadinstead ofcuModuleGetGlobal/cuModuleUnloadkernel.attributesinstead of CUDA driver callsdispatcher.py- Launch infrastructure rewriteLaunchConfigandcuda.core.experimental.launch()_LaunchConfiguration.__init____getstate__/__setstate__for pickle support, but see figure out exactly what pickling a stream means #648Tests - Updated everywhere
CudaAPIError→CUDAErrorObjectCodeinstead of raw handlesExperimentalStream.from_handle()or_to_core_stream()Performance Impact
Initial concern: Early benchmarks showed 30-40% regression on single-arg cases.
Resolution: Addressed through optimization work in recent commits:
prepare_args) to minimize overhead