Skip to content

Allow Inspection of Link-Time Optimized PTX#326

Merged
isVoid merged 10 commits intoNVIDIA:mainfrom
isVoid:fea-ext-ptx-inspect
Jul 22, 2025
Merged

Allow Inspection of Link-Time Optimized PTX#326
isVoid merged 10 commits intoNVIDIA:mainfrom
isVoid:fea-ext-ptx-inspect

Conversation

@isVoid
Copy link
Contributor

@isVoid isVoid commented Jul 16, 2025

Adds kernel method inspect_lto_ptx, which allows inspection of link time optimized PTX. This method is only supported when lto=True is specified for the kernel.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@isVoid isVoid changed the title Draft: Allow Inspection of External PTXes Allow Inspection of Link-Time Optimized PTX Jul 17, 2025
@isVoid
Copy link
Contributor Author

isVoid commented Jul 17, 2025

/ok to test 758372c

@isVoid isVoid marked this pull request as ready for review July 17, 2025 22:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 17, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@isVoid
Copy link
Contributor Author

isVoid commented Jul 17, 2025

/ok to test 75f24d5

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Jul 18, 2025
@isVoid
Copy link
Contributor Author

isVoid commented Jul 21, 2025

/ok to test 62c197e

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks - this looks good... Just a couple of minor nits / questions on the diff.

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Jul 22, 2025
@isVoid
Copy link
Contributor Author

isVoid commented Jul 22, 2025

/ok to test 59e4084

@isVoid isVoid requested a review from gmarkall July 22, 2025 15:53
@gmarkall gmarkall added 4 - Waiting on CI Waiting for a CI run to finish successfully and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 22, 2025
@isVoid isVoid merged commit 01e1eb3 into NVIDIA:main Jul 22, 2025
39 checks passed
tpn added a commit to tpn/cccl that referenced this pull request Jul 24, 2025
…onversion.

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)

N.B. Depends on a very recent numba-cuda change by @isVoid:
     NVIDIA/numba-cuda#326.  I don't think a
     branch has been cut with this change yet, so... we'll need to wait
     for that before we can pin an appropriate version and test this in
     CI.
@isVoid isVoid added 5 - Ready to merge Testing and reviews complete, ready to merge and removed 4 - Waiting on CI Waiting for a CI run to finish successfully labels Jul 24, 2025
tpn added a commit to tpn/cccl that referenced this pull request Jul 24, 2025
…onversion.

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)

N.B. Depends on a very recent numba-cuda change by @isVoid:
     NVIDIA/numba-cuda#326.  I don't think a
     branch has been cut with this change yet, so... we'll need to wait
     for that before we can pin an appropriate version and test this in
     CI.
tpn added a commit to tpn/cccl that referenced this pull request Jul 25, 2025
…onversion.

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)

N.B. Depends on a very recent numba-cuda change by @isVoid:
     NVIDIA/numba-cuda#326.  I don't think a
     branch has been cut with this change yet, so... we'll need to wait
     for that before we can pin an appropriate version and test this in
     CI.
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jul 31, 2025
- [NFC] FileCheck tests check all overloads (NVIDIA#354)
- [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific refactoring and changes (NVIDIA#349)
- Vendor in usecases used in testing (NVIDIA#359)
- Add thirdparty tests of numba extensions (NVIDIA#348)
- Support running tests in parallel (NVIDIA#350)
- Add more debuginfo tests (NVIDIA#358)
- [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and CUDACacheImpl to allow for future CUDA-specific refactoring and changes (NVIDIA#334)
- [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future CUDA-specific customization (NVIDIA#338)
- Vendor in BaseNativeLowering and BaseLower for CUDA-specific customizations (NVIDIA#329)
- [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for future CUDA-specific refactoring and changes (NVIDIA#322)
- Vendor in Codegen and CodeLibrary for CUDA-specific customization (NVIDIA#327)
- Disable tests that deadlock due to NVIDIA#317 (NVIDIA#356)
- FIX: Add type check for shape elements in DeviceNDArrayBase constructor (NVIDIA#352)
- Merge pull request NVIDIA#265 from lakshayg/fp16-support
- Add performance warning
- Fix tests
- Create and register low++ bindings for float16
- Create typing/target registries for float16
- Replace Numbast generated lower_casts
- Replace Numbast generated operators
- Alias __half to numba.core.types.float16
- Generate fp16 bindings using numbast
- Remove existing fp16 logic
- [REVIEW][NFC] Vendor in the utils and cgutils to allow for future CUDA-specific refactoring and changes (NVIDIA#340)
- [RFC,TESTING] Add filecheck test infrastructure (NVIDIA#342)
- Migrate test infra to pytest (NVIDIA#347)
- Add .vscode to gitignore (NVIDIA#344)
- [NFC] Add dev dependencies to project config (NVIDIA#341)
- Allow Inspection of Link-Time Optimized PTX (NVIDIA#326)
- [NFC] Vendor in DIBuilder used by CUDADIBuilder (NVIDIA#332)
- Add guidance on setting up pre-commit (NVIDIA#339)
- [Refactor][NFC] Vendor in MinimalCallConv (NVIDIA#333)
- [Refactor][NFC] Vendor in BaseCallConv (NVIDIA#324)
- [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for future CUDA-specific customizations (NVIDIA#325)
@gmarkall gmarkall mentioned this pull request Jul 31, 2025
gmarkall added a commit that referenced this pull request Jul 31, 2025
- [NFC] FileCheck tests check all overloads (#354)
- [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific
refactoring and changes (#349)
- Vendor in usecases used in testing (#359)
- Add thirdparty tests of numba extensions (#348)
- Support running tests in parallel (#350)
- Add more debuginfo tests (#358)
- [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and
CUDACacheImpl to allow for future CUDA-specific refactoring and changes
(#334)
- [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future
CUDA-specific customization (#338)
- Vendor in BaseNativeLowering and BaseLower for CUDA-specific
customizations (#329)
- [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for
future CUDA-specific refactoring and changes (#322)
- Vendor in Codegen and CodeLibrary for CUDA-specific customization
(#327)
- Disable tests that deadlock due to #317 (#356)
- FIX: Add type check for shape elements in DeviceNDArrayBase
constructor (#352)
- Merge pull request #265 from lakshayg/fp16-support
- Add performance warning
- Fix tests
- Create and register low++ bindings for float16
- Create typing/target registries for float16
- Replace Numbast generated lower_casts
- Replace Numbast generated operators
- Alias __half to numba.core.types.float16
- Generate fp16 bindings using numbast
- Remove existing fp16 logic
- [REVIEW][NFC] Vendor in the utils and cgutils to allow for future
CUDA-specific refactoring and changes (#340)
- [RFC,TESTING] Add filecheck test infrastructure (#342)
- Migrate test infra to pytest (#347)
- Add .vscode to gitignore (#344)
- [NFC] Add dev dependencies to project config (#341)
- Allow Inspection of Link-Time Optimized PTX (#326)
- [NFC] Vendor in DIBuilder used by CUDADIBuilder (#332)
- Add guidance on setting up pre-commit (#339)
- [Refactor][NFC] Vendor in MinimalCallConv (#333)
- [Refactor][NFC] Vendor in BaseCallConv (#324)
- [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for
future CUDA-specific customizations (#325)
copy-pr-bot bot pushed a commit to NVIDIA/cccl that referenced this pull request Sep 2, 2025
…onversion.

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)

N.B. Depends on a very recent numba-cuda change by @isVoid:
     NVIDIA/numba-cuda#326.  I don't think a
     branch has been cut with this change yet, so... we'll need to wait
     for that before we can pin an appropriate version and test this in
     CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to merge Testing and reviews complete, ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants