Skip to content

[AMD] Handle denorms properly for exp2 and exp#3816

Merged
antiagainst merged 6 commits intotriton-lang:mainfrom
zhanglx13:exp2_ftz
Jun 5, 2024
Merged

[AMD] Handle denorms properly for exp2 and exp#3816
antiagainst merged 6 commits intotriton-lang:mainfrom
zhanglx13:exp2_ftz

Conversation

@zhanglx13
Copy link
Copy Markdown
Collaborator

@zhanglx13 zhanglx13 commented May 2, 2024

This PR enables denorm flushing for tl.math.exp2 and preserves denorms for tl.math.exp, which match their behaviors on Nvidia backend.

More specifically,

  • denorm flushing for tl.math.exp2 with f32 inputs is controlled by __CUDA_FTZ or __HIP_FTZ and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument.
tl.math.exp2(f32) NV NV AMD AMD
control flag __CUDA_FTZ=1 (default) __CUDA_FTZ=0 __HIP_FTZ=1 (default) __HIP_FTZ=0
device lib __nv_exp2f __nv_exp2f
llvm intrinsics llvm.nvvm.ex2.approx.ftz.f llvm.nvvm.ex2.approx.f llvm.amdgcn.exp2.f32 llvm.exp2.f32
ptx ex2.approx.ftz.f32 ex2.approx.f32    
sass/amdgcn MUFU.EX2 MUFU.EX2
and instructions to
check and adjust for
denorms
v_exp_f32 v_exp_f32
and instructions
to check and
adjust for
denorms
  • denorms are preserved for tl.math.exp2 with f64 inputs
tl.math.exp2(f64) NV AMD
device lib __nv_exp2 __ocml_exp2_f64
  • denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the .ftz flag.
tl.math.exp(f32) NV AMD
llvm intrinsics   llvm.exp2.f32
ptx ex2.approx.f32  
tl.math.exp(f64) NV AMD
device lib __nv_exp __ocml_exp_f64

@zhanglx13 zhanglx13 force-pushed the exp2_ftz branch 3 times, most recently from 21f3f15 to 66231b2 Compare May 2, 2024 19:41
@zhanglx13 zhanglx13 changed the title [WIP] [AMD] Enable denorm flushing for exp2 [AMD] Handle denorms properly for exp2 and exp May 3, 2024
@zhanglx13 zhanglx13 marked this pull request as ready for review May 3, 2024 02:56
@zhanglx13 zhanglx13 requested a review from antiagainst as a code owner May 3, 2024 02:56
@zhanglx13 zhanglx13 requested a review from zahimoud May 3, 2024 02:56
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp Outdated
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp Outdated
// of the value of kernel arg `allow_flush_denorm`.
// 2. If __HIP_FTZ = 0, whether exp2 flushes denorms in input and output
// depends on the value of kernel arg `allow_flush_denorm`.
// 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for clang, this is subtarget dependent. Anything gfx9+ defaults to not flushing f32

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default to flushing to match the behavior on the NV path.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check with different SM version targets? I think they also switch the default at some level

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From NV libdevice, they use __CUDA_FTZ to control whether .ftz should be used. I checked the ptx and sass on A100. I can check H100 to make sure this is also the case there.

@zhanglx13 zhanglx13 marked this pull request as draft May 3, 2024 18:01
@zhanglx13
Copy link
Copy Markdown
Collaborator Author

Converting to draft after offline discussion with Matt. We may need to figure out a way to have good perf without "breaking" the exp/exp2 ops.

@arsenm
Copy link
Copy Markdown

arsenm commented May 3, 2024

The global __HIP_FTZ/__CUDA_FTZ should imply the global FP mode setting. If those are enabled you should be able to set denormal-fp-math-f32=preserve-sign,preserve-sign on every function and the backend will handle this. You only need additional complexity if you need to contextually ignore denormal handling with standard IEEE handling

Copy link
Copy Markdown
Member

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks to really digging into all the details about this, @zhanglx13! Much appreciated. I have some comments inlined around a few issues. Aside from them, could you also update the language doc for exp2 to be clear on the denorm flushing behavior?

Comment thread third_party/amd/backend/compiler.py
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp Outdated
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp Outdated
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp
@zhanglx13 zhanglx13 force-pushed the exp2_ftz branch 2 times, most recently from ee196f4 to 19fe675 Compare May 27, 2024 02:53
@zhanglx13
Copy link
Copy Markdown
Collaborator Author

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

Copy link
Copy Markdown
Member

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Just two more nits. LGTM.

// For non-FP32 input, call __ocml_exp2_f64 for higher-precision calculation
if (elemTy.getIntOrFloatBitWidth() != 32)
return {};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here worth a comment saying the former flushes denorm values and the later expands to LLVM instructions to handle denorm values + the former.

// later pass will call __ocml_exp_f64 for higher-precision calculation
patterns.add<ExpOpConversionApprox>(typeConverter, axisInfoAnalysis, benefit);
// Exp2OpConversion will use llvm.exp2.f32 or llvm.amdgcn.exp2.f32
// based on the __HIP_FTZ flag if the input type is FP32. For FP64 input,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace __HIP_FTZ with ftz here--ftz is a parameter to the function so it's clear what we are referring to.

Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/ElementwiseOpToLLVM.cpp
@antiagainst antiagainst marked this pull request as ready for review May 29, 2024 03:37
@antiagainst
Copy link
Copy Markdown
Member

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

@zhanglx13
Copy link
Copy Markdown
Collaborator Author

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

I'll do it in a follow up PR with an update for all the math function in the doc.

Copy link
Copy Markdown

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this supposed tested? I don't see any lit tests

## depends on the value of kernel arg `allow_flush_denorm`.
## 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.
## For now it is used as a controller for developers only.
__HIP_FTZ = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this be specifically control exp makes no sense. The name, by lack of exp, and similarity to the module level __CUDA_FTZ, would imply this is changing the global floating point environment denormal mode. There should be no special case modes for a specific operation

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use this __HIP_FTZ to control denorm handling behavior for all related math functions. This PR serves as a first step and selects exp2 because it affects performance of FA kernels.
__CUDA_FTZ is used only in the libdevice.bc. We don't have the DAZ_OPT flag in AMD ocml.bc anymore, so I exposed this flag here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose splitting it this way:

  1. Fix this to start directly emitting the llvm intrinsic directly
  2. Holistically change the mode to use FTZ, then you don't need to special case anything

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Fix this to start directly emitting the llvm intrinsic directly

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

  1. Holistically change the mode to use FTZ, then you don't need to special case anything

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations? Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

Correct

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations?

More precisely, this changes the default floating point mode.

Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

Sounds like an Nvidia bug to me, unless this is a specific "fast" exp function. You shouldn't be touching any global module flag for this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triton math op semantics are sort of "defined" by the nvidia instructions there given the history. The overall goal is to figure out the fine details for various ops (a lot there) and document them properly and make sure we are consistent. So I'd expect that's a lenghty procedure and we might not get everything perfect in one go. I think these are good points to follow up on that aren't blocking.

@zhanglx13
Copy link
Copy Markdown
Collaborator Author

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side. So I assume we don't need to check for instructions to handle denorms.
And this PR is aiming to improve the performance of FA kernels. And we have verified offline.

@arsenm
Copy link
Copy Markdown

arsenm commented May 30, 2024

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side.

Bad test coverage elsewhere isn't a good reason to not have tests. Any compiler codegen change should have lit testing

So I assume we don't need to check for instructions to handle denorms.

There is an output change, it should be tested

@zhanglx13
Copy link
Copy Markdown
Collaborator Author

@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode.
Regarding the design of using __HIP_FTZ, what do you think @antiagainst?

@zhanglx13 zhanglx13 requested review from antiagainst and arsenm May 31, 2024 15:52
@antiagainst
Copy link
Copy Markdown
Member

@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode. Regarding the design of using __HIP_FTZ, what do you think @antiagainst?

Thanks for the tests! Regarding __HIP_FTZ I think it's fine for now--as I commented in one inline thread #3816 (comment) we are still figuring out all the fine details of various ops regarding their fp math semantics; so there can be some back and forth. __HIP_FTZ is internal to the AMD backend here and it's well documented. So it won't be exposed to normal triton users and for folks touching the AMD backend it should also be clear for now what it is for. We can iterate on this with more math ops covered later.

@antiagainst antiagainst merged commit 05716d4 into triton-lang:main Jun 5, 2024
amjames pushed a commit to amjames/triton that referenced this pull request Jun 18, 2024
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms
for `tl.math.exp`, which match their behaviors on Nvidia backend.

More specifically, 
- denorm flushing for tl.math.exp2 with f32 inputs is controlled by
`__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm.
These flags can be set by developers, but are not exposed as kernel
argument.

tl.math.exp2(f32) | NV | NV | AMD | AMD
-- | -- | -- | -- | --
control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1
(default) | __HIP_FTZ=0
device lib | __nv_exp2f | __nv_exp2f |  | 
llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f |
llvm.amdgcn.exp2.f32 | llvm.exp2.f32
ptx | ex2.approx.ftz.f32 | ex2.approx.f32 |   |  
sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and
adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to
check and<br>adjust for<br>denorms
- denorms are preserved for tl.math.exp2 with f64 inputs

tl.math.exp2(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp2 | __ocml_exp2_f64
- denorms are preserved for tl.math.exp with both f32 and f64 inputs.
Note that tl.math.exp(f32) on nv path is lowered with inline ptx
directly without the `.ftz` flag.

tl.math.exp(f32) | NV | AMD
-- | -- | --
llvm intrinsics |   | llvm.exp2.f32
ptx | ex2.approx.f32 |  


tl.math.exp(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp | __ocml_exp_f64
amjames pushed a commit to amjames/triton that referenced this pull request Jun 18, 2024
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms
for `tl.math.exp`, which match their behaviors on Nvidia backend.

More specifically, 
- denorm flushing for tl.math.exp2 with f32 inputs is controlled by
`__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm.
These flags can be set by developers, but are not exposed as kernel
argument.

tl.math.exp2(f32) | NV | NV | AMD | AMD
-- | -- | -- | -- | --
control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1
(default) | __HIP_FTZ=0
device lib | __nv_exp2f | __nv_exp2f |  | 
llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f |
llvm.amdgcn.exp2.f32 | llvm.exp2.f32
ptx | ex2.approx.ftz.f32 | ex2.approx.f32 |   |  
sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and
adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to
check and<br>adjust for<br>denorms
- denorms are preserved for tl.math.exp2 with f64 inputs

tl.math.exp2(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp2 | __ocml_exp2_f64
- denorms are preserved for tl.math.exp with both f32 and f64 inputs.
Note that tl.math.exp(f32) on nv path is lowered with inline ptx
directly without the `.ftz` flag.

tl.math.exp(f32) | NV | AMD
-- | -- | --
llvm intrinsics |   | llvm.exp2.f32
ptx | ex2.approx.f32 |  


tl.math.exp(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp | __ocml_exp_f64
ptillet pushed a commit that referenced this pull request Jun 18, 2024
… it is not needed (#4161)

These need to be in this order to apply cleanly.


Cherry pick #3816 and #3790 for release/3.0.x.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms
for `tl.math.exp`, which match their behaviors on Nvidia backend.

More specifically, 
- denorm flushing for tl.math.exp2 with f32 inputs is controlled by
`__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm.
These flags can be set by developers, but are not exposed as kernel
argument.

tl.math.exp2(f32) | NV | NV | AMD | AMD
-- | -- | -- | -- | --
control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1
(default) | __HIP_FTZ=0
device lib | __nv_exp2f | __nv_exp2f |  | 
llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f |
llvm.amdgcn.exp2.f32 | llvm.exp2.f32
ptx | ex2.approx.ftz.f32 | ex2.approx.f32 |   |  
sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and
adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to
check and<br>adjust for<br>denorms
- denorms are preserved for tl.math.exp2 with f64 inputs

tl.math.exp2(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp2 | __ocml_exp2_f64
- denorms are preserved for tl.math.exp with both f32 and f64 inputs.
Note that tl.math.exp(f32) on nv path is lowered with inline ptx
directly without the `.ftz` flag.

tl.math.exp(f32) | NV | AMD
-- | -- | --
llvm intrinsics |   | llvm.exp2.f32
ptx | ex2.approx.f32 |  


tl.math.exp(f64) | NV | AMD
-- | -- | --
device lib | __nv_exp | __ocml_exp_f64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants