Cherry picks for 3.0.x release#4159
Closed
amjames wants to merge 8 commits intotriton-lang:release/3.0.xfrom
Closed
Cherry picks for 3.0.x release#4159amjames wants to merge 8 commits intotriton-lang:release/3.0.xfrom
amjames wants to merge 8 commits intotriton-lang:release/3.0.xfrom
Conversation
In the current implementation, when backward rematerialization encounters a loop argument that has already been rematerialized, the process short-circuits the collection of yield operations, leaving the value in the slice. However, if another loop argument is present in the same slice, the loop is collected again, resulting in the duplication of the first argument without generating the corresponding yield. To address this issue, this fix removes values from the slice that are skipped during collection, ensuring they are not reduplicated. This adjustment ensures that the number of yield operands and loop iter_args remain synchronized.
These math functions are used in pytorch inductor but missing in the current list of hip libdevice.
… it is not needed (triton-lang#3790) This PR: - moves shortcut check earlier, to not compute scratch buffer shape if it is not needed - raise priority of AMD specific over common conversions to eliminate uncertainty which pattern to apply. - add regression test for MFMA to Dot Op shortcut
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms for `tl.math.exp`, which match their behaviors on Nvidia backend. More specifically, - denorm flushing for tl.math.exp2 with f32 inputs is controlled by `__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument. tl.math.exp2(f32) | NV | NV | AMD | AMD -- | -- | -- | -- | -- control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1 (default) | __HIP_FTZ=0 device lib | __nv_exp2f | __nv_exp2f | | llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f | llvm.amdgcn.exp2.f32 | llvm.exp2.f32 ptx | ex2.approx.ftz.f32 | ex2.approx.f32 | | sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to check and<br>adjust for<br>denorms - denorms are preserved for tl.math.exp2 with f64 inputs tl.math.exp2(f64) | NV | AMD -- | -- | -- device lib | __nv_exp2 | __ocml_exp2_f64 - denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the `.ftz` flag. tl.math.exp(f32) | NV | AMD -- | -- | -- llvm intrinsics | | llvm.exp2.f32 ptx | ex2.approx.f32 | tl.math.exp(f64) | NV | AMD -- | -- | -- device lib | __nv_exp | __ocml_exp_f64
The TritonGPUPipeline pass has unused pass options and the TritonGPUAccelerateMatmul pass option could instead be read from the module attributes, where the data already exists. The goal is to reduce redundancy. --------- Signed-off-by: Finlay Marno <finlay.marno@codeplay.com>
This will enable prefetching for mma-v2 dots on H100. --------- Co-authored-by: Manman Ren <mren@fb.com>
…patibility (triton-lang#4049) The dictionary merge operator was introduced in Python 3.9 and unfortunately PyTorch still supports 3.8. I think for this use case there is no downside to unpacking, other than that it's a bit ugly.
Contributor
Author
|
I will break this up into individual commits /dependent groups |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry pick the following for release 3.0.x: