-
Notifications
You must be signed in to change notification settings - Fork 54
perf: let CAI fall through instead of calling from_cuda_array_interface #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
| handled in numba.cuda.np.arrayobj. | ||
| """ | ||
| # Only handle constants, not arguments (arguments use regular array typing) | ||
| if c.purpose == Purpose.argument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning None here caused from_cuda_array_interface to be called, which is much more expensive than just executing the rest of this code.
Greptile SummaryThis PR optimizes type inference for third-party device arrays (CuPy, PyTorch) that implement Key changes:
Confidence Score: 5/5
Important Files Changed
|
|
/ok to test |
|
Threw in a commit with some more shape/stride caching, which gives a bit more perf gains across the board relative to just removing the |
|
/ok to test |
- Add arch specific target support (NVIDIA#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714) - ci: relock pixi (NVIDIA#712) - ci: remove redundant conda build in ci (NVIDIA#711) - chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694) - chore: perf lint (NVIDIA#697) - chore(deps): bump deps in pixi lockfile (NVIDIA#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682) - Support python `3.14` (NVIDIA#599) - Remove customized address space tracking and address class emission in debug info (NVIDIA#669) - Drop `experimental` from cuda.core namespace imports (NVIDIA#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675) - Use `rapidsai/sccache` in CI (NVIDIA#674) - chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670) - Set up a new VM-based CI infrastructure (NVIDIA#604)
- Add arch specific target support (#549) - chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (#714) - ci: relock pixi (#712) - ci: remove redundant conda build in ci (#711) - chore(deps): bump numba-cuda version and relock pixi (#707) - Dropping bits in the old CI & Propagating recent changes from cuda-python (#683) - Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (#701) - perf: remove some exception control flow and buffer-exception penalization for arrays (#700) - perf: let CAI fall through instead of calling from_cuda_array_interface (#694) - chore: perf lint (#697) - chore(deps): bump deps in pixi lockfile (#693) - fix: use freethreading-supported `_PySet_NextItemRef` where possible (#682) - Support python `3.14` (#599) - Remove customized address space tracking and address class emission in debug info (#669) - Drop `experimental` from cuda.core namespace imports (#676) - Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675) - Use `rapidsai/sccache` in CI (#674) - chore(dev-deps): remove ipython and pyinstrument (#670) - Set up a new VM-based CI infrastructure (#604)

Remove
from_cuda_array_interfacecall when inferring type from cupy and torch arrays. Speed up mainly benefits dispatching calls, and is upwards of ~2.5x in some cases.