Allow Inspection of Link-Time Optimized PTX#326
Merged
isVoid merged 10 commits intoNVIDIA:mainfrom Jul 22, 2025
Merged
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test 758372c |
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test 75f24d5 |
6 tasks
… if nvjitlink is not found
Contributor
Author
|
/ok to test 62c197e |
gmarkall
reviewed
Jul 21, 2025
gmarkall
reviewed
Jul 21, 2025
gmarkall
reviewed
Jul 21, 2025
gmarkall
reviewed
Jul 21, 2025
gmarkall
reviewed
Jul 21, 2025
gmarkall
requested changes
Jul 21, 2025
Contributor
gmarkall
left a comment
There was a problem hiding this comment.
Many thanks - this looks good... Just a couple of minor nits / questions on the diff.
Contributor
Author
|
/ok to test 59e4084 |
gmarkall
approved these changes
Jul 22, 2025
tpn
added a commit
to tpn/cccl
that referenced
this pull request
Jul 24, 2025
…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.
tpn
added a commit
to tpn/cccl
that referenced
this pull request
Jul 24, 2025
…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.
tpn
added a commit
to tpn/cccl
that referenced
this pull request
Jul 25, 2025
…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.
gmarkall
added a commit
to gmarkall/numba-cuda
that referenced
this pull request
Jul 31, 2025
- [NFC] FileCheck tests check all overloads (NVIDIA#354) - [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific refactoring and changes (NVIDIA#349) - Vendor in usecases used in testing (NVIDIA#359) - Add thirdparty tests of numba extensions (NVIDIA#348) - Support running tests in parallel (NVIDIA#350) - Add more debuginfo tests (NVIDIA#358) - [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and CUDACacheImpl to allow for future CUDA-specific refactoring and changes (NVIDIA#334) - [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future CUDA-specific customization (NVIDIA#338) - Vendor in BaseNativeLowering and BaseLower for CUDA-specific customizations (NVIDIA#329) - [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for future CUDA-specific refactoring and changes (NVIDIA#322) - Vendor in Codegen and CodeLibrary for CUDA-specific customization (NVIDIA#327) - Disable tests that deadlock due to NVIDIA#317 (NVIDIA#356) - FIX: Add type check for shape elements in DeviceNDArrayBase constructor (NVIDIA#352) - Merge pull request NVIDIA#265 from lakshayg/fp16-support - Add performance warning - Fix tests - Create and register low++ bindings for float16 - Create typing/target registries for float16 - Replace Numbast generated lower_casts - Replace Numbast generated operators - Alias __half to numba.core.types.float16 - Generate fp16 bindings using numbast - Remove existing fp16 logic - [REVIEW][NFC] Vendor in the utils and cgutils to allow for future CUDA-specific refactoring and changes (NVIDIA#340) - [RFC,TESTING] Add filecheck test infrastructure (NVIDIA#342) - Migrate test infra to pytest (NVIDIA#347) - Add .vscode to gitignore (NVIDIA#344) - [NFC] Add dev dependencies to project config (NVIDIA#341) - Allow Inspection of Link-Time Optimized PTX (NVIDIA#326) - [NFC] Vendor in DIBuilder used by CUDADIBuilder (NVIDIA#332) - Add guidance on setting up pre-commit (NVIDIA#339) - [Refactor][NFC] Vendor in MinimalCallConv (NVIDIA#333) - [Refactor][NFC] Vendor in BaseCallConv (NVIDIA#324) - [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for future CUDA-specific customizations (NVIDIA#325)
Merged
gmarkall
added a commit
that referenced
this pull request
Jul 31, 2025
- [NFC] FileCheck tests check all overloads (#354) - [REVIEW][NFC] Vendor in serialize to allow for future CUDA-specific refactoring and changes (#349) - Vendor in usecases used in testing (#359) - Add thirdparty tests of numba extensions (#348) - Support running tests in parallel (#350) - Add more debuginfo tests (#358) - [REVIEW][NFC] Vendor in the Cache, CacheImpl used by CUDACache and CUDACacheImpl to allow for future CUDA-specific refactoring and changes (#334) - [NFC] Vendor in Dispatcher as CUDADispatcher to allow for future CUDA-specific customization (#338) - Vendor in BaseNativeLowering and BaseLower for CUDA-specific customizations (#329) - [REVIEW] Vendor in the CompilerBase used by CUDACompiler to allow for future CUDA-specific refactoring and changes (#322) - Vendor in Codegen and CodeLibrary for CUDA-specific customization (#327) - Disable tests that deadlock due to #317 (#356) - FIX: Add type check for shape elements in DeviceNDArrayBase constructor (#352) - Merge pull request #265 from lakshayg/fp16-support - Add performance warning - Fix tests - Create and register low++ bindings for float16 - Create typing/target registries for float16 - Replace Numbast generated lower_casts - Replace Numbast generated operators - Alias __half to numba.core.types.float16 - Generate fp16 bindings using numbast - Remove existing fp16 logic - [REVIEW][NFC] Vendor in the utils and cgutils to allow for future CUDA-specific refactoring and changes (#340) - [RFC,TESTING] Add filecheck test infrastructure (#342) - Migrate test infra to pytest (#347) - Add .vscode to gitignore (#344) - [NFC] Add dev dependencies to project config (#341) - Allow Inspection of Link-Time Optimized PTX (#326) - [NFC] Vendor in DIBuilder used by CUDADIBuilder (#332) - Add guidance on setting up pre-commit (#339) - [Refactor][NFC] Vendor in MinimalCallConv (#333) - [Refactor][NFC] Vendor in BaseCallConv (#324) - [REVIEW] Vendor in CompileResult as CUDACompileResult to allow for future CUDA-specific customizations (#325)
copy-pr-bot bot
pushed a commit
to NVIDIA/cccl
that referenced
this pull request
Sep 2, 2025
…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds kernel method
inspect_lto_ptx, which allows inspection of link time optimized PTX. This method is only supported whenlto=Trueis specified for the kernel.