Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Jan 7, 2026

Remove some smaller overheads from kernel launch. These are pretty modest gains of between 5-8%, but they are reproducible.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 7, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud cpcloud changed the title small perf improvements perf: remove some exception control flow and buffer-exception penalization for arrays Jan 7, 2026
@cpcloud cpcloud requested a review from gmarkall January 7, 2026 15:37
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile Summary

This PR optimizes kernel launch overhead through cleaner code patterns that avoid exception handling:

  • numpy_support.py: Replaces try/except KeyError with dict.get() and try/except AttributeError with getattr() using the walrus operator. These changes eliminate exception overhead while maintaining identical functionality.

  • typeof.py: Reorders type checking to prioritize __cuda_array_interface__ before the buffer protocol check. This avoids expensive memoryview creation and exception handling for CuPy arrays and other CUDA-aware arrays.

Both changes follow the principle of avoiding exceptions for control flow, which is a Python performance best practice. The logic remains functionally equivalent with no behavioral changes.

Confidence Score: 5/5

  • This PR is safe to merge with no identified risks
  • The changes are pure performance optimizations that maintain identical behavior. Exception handling is replaced with standard Python idioms (.get(), getattr()), and type check reordering preserves logical correctness while improving efficiency for common cases (CuPy arrays). No functional changes or edge case issues identified.
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/np/numpy_support.py Replaced exception-based control flow with dict.get() and getattr() for cleaner, faster lookups. Changes are safe and idiomatic.
numba_cuda/numba/cuda/typing/typeof.py Moved buffer protocol check after __cuda_array_interface__ check to avoid expensive memoryview creation for CuPy arrays. Maintains correctness while improving performance.

@kkraus14
Copy link
Contributor

kkraus14 commented Jan 7, 2026

/ok to test

@kkraus14 kkraus14 enabled auto-merge (squash) January 7, 2026 16:02
@kkraus14 kkraus14 merged commit 459b8c0 into NVIDIA:main Jan 7, 2026
118 of 119 checks passed
@cpcloud cpcloud deleted the small-perf-improvements branch January 7, 2026 16:39
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jan 12, 2026
- Add arch specific target support (NVIDIA#549)
- chore: disable `locked` flag to bypass prefix-dev/pixi#5256 (NVIDIA#714)
- ci: relock pixi (NVIDIA#712)
- ci: remove redundant conda build in ci (NVIDIA#711)
- chore(deps): bump numba-cuda version and relock pixi (NVIDIA#707)
- Dropping bits in the old CI & Propagating recent changes from cuda-python (NVIDIA#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and `nvrtc` packages for CUDA 13 (NVIDIA#701)
- perf: remove some exception control flow and buffer-exception penalization for arrays (NVIDIA#700)
- perf: let CAI fall through instead of calling from_cuda_array_interface (NVIDIA#694)
- chore: perf lint (NVIDIA#697)
- chore(deps): bump deps in pixi lockfile (NVIDIA#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible (NVIDIA#682)
- Support python `3.14` (NVIDIA#599)
- Remove customized address space tracking and address class emission in debug info (NVIDIA#669)
- Drop `experimental` from cuda.core namespace imports (NVIDIA#676)
- Remove dangling references to NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (NVIDIA#675)
- Use `rapidsai/sccache` in CI (NVIDIA#674)
- chore(dev-deps): remove ipython and pyinstrument (NVIDIA#670)
- Set up a new VM-based CI infrastructure  (NVIDIA#604)
@gmarkall gmarkall mentioned this pull request Jan 12, 2026
gmarkall added a commit that referenced this pull request Jan 12, 2026
- Add arch specific target support (#549)
- chore: disable `locked` flag to bypass
prefix-dev/pixi#5256 (#714)
- ci: relock pixi (#712)
- ci: remove redundant conda build in ci (#711)
- chore(deps): bump numba-cuda version and relock pixi (#707)
- Dropping bits in the old CI & Propagating recent changes from
cuda-python (#683)
- Fix `test_wheel_deps_wheels.sh` to actually uninstall `nvvm` and
`nvrtc` packages for CUDA 13 (#701)
- perf: remove some exception control flow and buffer-exception
penalization for arrays (#700)
- perf: let CAI fall through instead of calling
from_cuda_array_interface (#694)
- chore: perf lint (#697)
- chore(deps): bump deps in pixi lockfile (#693)
- fix: use freethreading-supported `_PySet_NextItemRef` where possible
(#682)
- Support python `3.14` (#599)
- Remove customized address space tracking and address class emission in
debug info (#669)
- Drop `experimental` from cuda.core namespace imports (#676)
- Remove dangling references to
NUMBA_CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY (#675)
- Use `rapidsai/sccache` in CI (#674)
- chore(dev-deps): remove ipython and pyinstrument (#670)
- Set up a new VM-based CI infrastructure  (#604)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants