This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Speed fused_op compilation by caching ptx and jit-compiled device functions #16783
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR speeds up the dynamic nvrtc-compilation of fused_ops in response to @rondogency's comment #15167 (comment). As reported in the comment, the runtime of 3 mentioned unittests had grown drastically with the fusion enabled to 17.5 minutes in total. With this PR, the runtime drops to 1 minute, with the original fusion-turned-off runtime being 30 seconds.
The process of runtime compilation of NVIDIA gpu kernels involves 2 steps:
- compiling the cuda code to PTX assembly (performed once per GPU architecture)
- translating the ptx assembly to binary and loading it into a GPU's set of runnable kernels (performed once per GPU device). This latter step produces the CUfunction needed to execute the kernel on the device.
After realizing that the slowed-down unittests were creating many identical fused ops, I added a cache of the PTX and CUfunctions. The cache comprises a mapping (for each GPU arch) from the cuda source code to the PTX and to any CUfunctions created from it.
It's worth a reminder that the fusion framework is targeting the typical scenario of creating a model's graph and executing it many times. The CI was adversely impacted because it often executes a model's graph just once after creation.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments