Skip to content

Cherry picks for 3.0.x release#4159

Closed
amjames wants to merge 8 commits intotriton-lang:release/3.0.xfrom
amjames:cherry-pick-release
Closed

Cherry picks for 3.0.x release#4159
amjames wants to merge 8 commits intotriton-lang:release/3.0.xfrom
amjames:cherry-pick-release

Conversation

amjames and others added 8 commits June 17, 2024 22:22
In the current implementation, when backward rematerialization
encounters a loop argument that has already been rematerialized, the
process short-circuits the collection of yield operations, leaving the
value in the slice. However, if another loop argument is present in the
same slice, the loop is collected again, resulting in the duplication of
the first argument without generating the corresponding yield.

To address this issue, this fix removes values from the slice that are
skipped during collection, ensuring they are not reduplicated. This
adjustment ensures that the number of yield operands and loop iter_args
remain synchronized.
These math functions are used in pytorch inductor but missing in the
current list of hip libdevice.
… it is not needed (triton-lang#3790)

This PR:
- moves shortcut check earlier, to not compute scratch buffer shape if
it is not needed
- raise priority of AMD specific over common conversions to eliminate
uncertainty which pattern to apply.
- add regression test for MFMA to Dot Op shortcut
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms
for `tl.math.exp`, which match their behaviors on Nvidia backend.

More specifically, 
- denorm flushing for tl.math.exp2 with f32 inputs is controlled by
`__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm.
These flags can be set by developers, but are not exposed as kernel
argument.

tl.math.exp2(f32) | NV | NV | AMD | AMD
-- | -- | -- | -- | --
control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1
(default) | __HIP_FTZ=0
device lib | __nv_exp2f | __nv_exp2f |  | 
llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f |
llvm.amdgcn.exp2.f32 | llvm.exp2.f32
ptx | ex2.approx.ftz.f32 | ex2.approx.f32 |   |  
sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and
adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to
check and<br>adjust for<br>denorms
- denorms are preserved for tl.math.exp2 with f64 inputs

tl.math.exp2(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp2 | __ocml_exp2_f64
- denorms are preserved for tl.math.exp with both f32 and f64 inputs.
Note that tl.math.exp(f32) on nv path is lowered with inline ptx
directly without the `.ftz` flag.

tl.math.exp(f32) | NV | AMD
-- | -- | --
llvm intrinsics |   | llvm.exp2.f32
ptx | ex2.approx.f32 |  


tl.math.exp(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp | __ocml_exp_f64
The TritonGPUPipeline pass has unused pass options and the
TritonGPUAccelerateMatmul pass option could instead be read from the
module attributes, where the data already exists. The goal is to reduce
redundancy.

---------

Signed-off-by: Finlay Marno <finlay.marno@codeplay.com>
This will enable prefetching for mma-v2 dots on H100.

---------

Co-authored-by: Manman Ren <mren@fb.com>
…patibility (triton-lang#4049)

The dictionary merge operator was introduced in Python 3.9 and
unfortunately PyTorch still supports 3.8. I think for this use case
there is no downside to unpacking, other than that it's a bit ugly.
@amjames
Copy link
Copy Markdown
Contributor Author

amjames commented Jun 18, 2024

I will break this up into individual commits /dependent groups

@amjames amjames closed this Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants