Allow Inspection of Link-Time Optimized PTX by isVoid · Pull Request #326 · NVIDIA/numba-cuda

isVoid · 2025-07-16T22:58:07Z

Adds kernel method inspect_lto_ptx, which allows inspection of link time optimized PTX. This method is only supported when lto=True is specified for the kernel.

copy-pr-bot · 2025-07-16T22:58:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

isVoid · 2025-07-17T22:39:16Z

/ok to test 758372c

copy-pr-bot · 2025-07-17T22:39:23Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

isVoid · 2025-07-17T22:57:33Z

/ok to test 75f24d5

… if nvjitlink is not found

isVoid · 2025-07-21T17:05:12Z

/ok to test 62c197e

numba_cuda/numba/cuda/testing.py

numba_cuda/numba/cuda/dispatcher.py

numba_cuda/numba/cuda/tests/cudapy/test_inspect.py

gmarkall

Many thanks - this looks good... Just a couple of minor nits / questions on the diff.

isVoid · 2025-07-22T15:46:00Z

/ok to test 59e4084

@isVoid

…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.

@isVoid

…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.

@isVoid

…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.

- [NFC] FileCheck tests check all overloads (NVIDIA#354) - [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific refactoring and changes (NVIDIA#349) - Vendor in usecases used in testing (NVIDIA#359) - Add thirdparty tests of numba extensions (NVIDIA#348) - Support running tests in parallel (NVIDIA#350) - Add more debuginfo tests (NVIDIA#358) - [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and CUDACacheImpl to allow for future CUDA-specific refactoring and changes (NVIDIA#334) - [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future CUDA-specific customization (NVIDIA#338) - Vendor in BaseNativeLowering and BaseLower for CUDA-specific customizations (NVIDIA#329) - [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for future CUDA-specific refactoring and changes (NVIDIA#322) - Vendor in Codegen and CodeLibrary for CUDA-specific customization (NVIDIA#327) - Disable tests that deadlock due to NVIDIA#317 (NVIDIA#356) - FIX: Add type check for shape elements in DeviceNDArrayBase constructor (NVIDIA#352) - Merge pull request NVIDIA#265 from lakshayg/fp16-support - Add performance warning - Fix tests - Create and register low++ bindings for float16 - Create typing/target registries for float16 - Replace Numbast generated lower_casts - Replace Numbast generated operators - Alias __half to numba.core.types.float16 - Generate fp16 bindings using numbast - Remove existing fp16 logic - [REVIEW][NFC] Vendor in the utils and cgutils to allow for future CUDA-specific refactoring and changes (NVIDIA#340) - [RFC,TESTING] Add filecheck test infrastructure (NVIDIA#342) - Migrate test infra to pytest (NVIDIA#347) - Add .vscode to gitignore (NVIDIA#344) - [NFC] Add dev dependencies to project config (NVIDIA#341) - Allow Inspection of Link-Time Optimized PTX (NVIDIA#326) - [NFC] Vendor in DIBuilder used by CUDADIBuilder (NVIDIA#332) - Add guidance on setting up pre-commit (NVIDIA#339) - [Refactor][NFC] Vendor in MinimalCallConv (NVIDIA#333) - [Refactor][NFC] Vendor in BaseCallConv (NVIDIA#324) - [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for future CUDA-specific customizations (NVIDIA#325)

- [NFC] FileCheck tests check all overloads (#354) - [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific refactoring and changes (#349) - Vendor in usecases used in testing (#359) - Add thirdparty tests of numba extensions (#348) - Support running tests in parallel (#350) - Add more debuginfo tests (#358) - [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and CUDACacheImpl to allow for future CUDA-specific refactoring and changes (#334) - [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future CUDA-specific customization (#338) - Vendor in BaseNativeLowering and BaseLower for CUDA-specific customizations (#329) - [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for future CUDA-specific refactoring and changes (#322) - Vendor in Codegen and CodeLibrary for CUDA-specific customization (#327) - Disable tests that deadlock due to #317 (#356) - FIX: Add type check for shape elements in DeviceNDArrayBase constructor (#352) - Merge pull request #265 from lakshayg/fp16-support - Add performance warning - Fix tests - Create and register low++ bindings for float16 - Create typing/target registries for float16 - Replace Numbast generated lower_casts - Replace Numbast generated operators - Alias __half to numba.core.types.float16 - Generate fp16 bindings using numbast - Remove existing fp16 logic - [REVIEW][NFC] Vendor in the utils and cgutils to allow for future CUDA-specific refactoring and changes (#340) - [RFC,TESTING] Add filecheck test infrastructure (#342) - Migrate test infra to pytest (#347) - Add .vscode to gitignore (#344) - [NFC] Add dev dependencies to project config (#341) - Allow Inspection of Link-Time Optimized PTX (#326) - [NFC] Vendor in DIBuilder used by CUDADIBuilder (#332) - Add guidance on setting up pre-commit (#339) - [Refactor][NFC] Vendor in MinimalCallConv (#333) - [Refactor][NFC] Vendor in BaseCallConv (#324) - [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for future CUDA-specific customizations (#325)

@isVoid

…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.

initial pass on external ptxes

29e0a4c

use -ptx flag for nvjitlink to produce final link-time optimized ptx

f0bef36

isVoid changed the title ~~Draft: Allow Inspection of External PTXes~~ Allow Inspection of Link-Time Optimized PTX Jul 17, 2025

fix codegen

758372c

isVoid marked this pull request as ready for review July 17, 2025 22:39

cross platform write temp file

75f24d5

gmarkall added the 2 - In Progress Currently a work in progress label Jul 18, 2025

lakshayg mentioned this pull request Jul 18, 2025

Use Numbast generated float16 bindings #265

Merged

6 tasks

pass in pointer to circumvent a problem relating to fp16, raise error…

62c197e

… if nvjitlink is not found