[AMD] Handle denorms properly for exp2 and exp by zhanglx13 · Pull Request #3816 · triton-lang/triton

zhanglx13 · 2024-05-02T04:58:56Z

This PR enables denorm flushing for tl.math.exp2 and preserves denorms for tl.math.exp, which match their behaviors on Nvidia backend.

More specifically,

denorm flushing for tl.math.exp2 with f32 inputs is controlled by __CUDA_FTZ or __HIP_FTZ and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument.

tl.math.exp2(f32)	NV	NV	AMD	AMD
control flag	__CUDA_FTZ=1 (default)	__CUDA_FTZ=0	__HIP_FTZ=1 (default)	__HIP_FTZ=0
device lib	__nv_exp2f	__nv_exp2f
llvm intrinsics	llvm.nvvm.ex2.approx.ftz.f	llvm.nvvm.ex2.approx.f	llvm.amdgcn.exp2.f32	llvm.exp2.f32
ptx	ex2.approx.ftz.f32	ex2.approx.f32
sass/amdgcn	MUFU.EX2	MUFU.EX2 and instructions to check and adjust for denorms	v_exp_f32	v_exp_f32 and instructions to check and adjust for denorms

denorms are preserved for tl.math.exp2 with f64 inputs

tl.math.exp2(f64)	NV	AMD
device lib	__nv_exp2	__ocml_exp2_f64

denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the .ftz flag.

tl.math.exp(f32)	NV	AMD
llvm intrinsics		llvm.exp2.f32
ptx	ex2.approx.f32

tl.math.exp(f64)	NV	AMD
device lib	__nv_exp	__ocml_exp_f64

arsenm · 2024-05-03T15:22:57Z

+  //    of the value of kernel arg `allow_flush_denorm`.
+  // 2. If __HIP_FTZ = 0, whether exp2 flushes denorms in input and output
+  //    depends on the value of kernel arg `allow_flush_denorm`.
+  // 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.


At least for clang, this is subtarget dependent. Anything gfx9+ defaults to not flushing f32

I set the default to flushing to match the behavior on the NV path.

Did you check with different SM version targets? I think they also switch the default at some level

From NV libdevice, they use __CUDA_FTZ to control whether .ftz should be used. I checked the ptx and sass on A100. I can check H100 to make sure this is also the case there.

zhanglx13 · 2024-05-03T18:04:17Z

Converting to draft after offline discussion with Matt. We may need to figure out a way to have good perf without "breaking" the exp/exp2 ops.

arsenm · 2024-05-03T18:11:31Z

The global __HIP_FTZ/__CUDA_FTZ should imply the global FP mode setting. If those are enabled you should be able to set denormal-fp-math-f32=preserve-sign,preserve-sign on every function and the backend will handle this. You only need additional complexity if you need to contextually ignore denormal handling with standard IEEE handling

antiagainst

Nice, thanks to really digging into all the details about this, @zhanglx13! Much appreciated. I have some comments inlined around a few issues. Aside from them, could you also update the language doc for exp2 to be clear on the denorm flushing behavior?

zhanglx13 · 2024-05-27T03:02:11Z

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

antiagainst

Nice, thanks! Just two more nits. LGTM.

antiagainst · 2024-05-29T03:34:35Z

+    // For non-FP32 input, call __ocml_exp2_f64 for higher-precision calculation
+    if (elemTy.getIntOrFloatBitWidth() != 32)
+      return {};
+


Here worth a comment saying the former flushes denorm values and the later expands to LLVM instructions to handle denorm values + the former.

antiagainst · 2024-05-29T03:35:39Z

+  // later pass will call __ocml_exp_f64 for higher-precision calculation
  patterns.add<ExpOpConversionApprox>(typeConverter, axisInfoAnalysis, benefit);
+  // Exp2OpConversion will use llvm.exp2.f32 or llvm.amdgcn.exp2.f32
+  // based on the __HIP_FTZ flag if the input type is FP32. For FP64 input,


Replace __HIP_FTZ with ftz here--ftz is a parameter to the function so it's clear what we are referring to.

antiagainst · 2024-05-29T03:38:49Z

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

zhanglx13 · 2024-05-29T04:43:47Z

@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet.

I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though.

I'll do it in a follow up PR with an update for all the math function in the doc.

arsenm

How is this supposed tested? I don't see any lit tests

arsenm · 2024-05-29T11:04:28Z

+        ##    depends on the value of kernel arg `allow_flush_denorm`.
+        ## 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument.
+        ##    For now it is used as a controller for developers only.
+        __HIP_FTZ = True


Having this be specifically control exp makes no sense. The name, by lack of exp, and similarity to the module level __CUDA_FTZ, would imply this is changing the global floating point environment denormal mode. There should be no special case modes for a specific operation

The plan is to use this __HIP_FTZ to control denorm handling behavior for all related math functions. This PR serves as a first step and selects exp2 because it affects performance of FA kernels.
__CUDA_FTZ is used only in the libdevice.bc. We don't have the DAZ_OPT flag in AMD ocml.bc anymore, so I exposed this flag here.

I propose splitting it this way:

Fix this to start directly emitting the llvm intrinsic directly

Holistically change the mode to use FTZ, then you don't need to special case anything

Fix this to start directly emitting the llvm intrinsic directly

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

Holistically change the mode to use FTZ, then you don't need to special case anything

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations? Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.

Correct

Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations?

More precisely, this changes the default floating point mode.

Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.

Sounds like an Nvidia bug to me, unless this is a specific "fast" exp function. You shouldn't be touching any global module flag for this

Triton math op semantics are sort of "defined" by the nvidia instructions there given the history. The overall goal is to figure out the fine details for various ops (a lot there) and document them properly and make sure we are consistent. So I'd expect that's a lenghty procedure and we might not get everything perfect in one go. I think these are good points to follow up on that aren't blocking.

zhanglx13 · 2024-05-29T16:27:43Z

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side. So I assume we don't need to check for instructions to handle denorms.
And this PR is aiming to improve the performance of FA kernels. And we have verified offline.

arsenm · 2024-05-30T14:57:12Z

How is this supposed tested? I don't see any lit tests

I don't see any lit tests for denorm handling on the NV side.

Bad test coverage elsewhere isn't a good reason to not have tests. Any compiler codegen change should have lit testing

So I assume we don't need to check for instructions to handle denorms.

There is an output change, it should be tested

zhanglx13 · 2024-05-31T15:51:15Z

@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode.
Regarding the design of using __HIP_FTZ, what do you think @antiagainst?

antiagainst · 2024-06-05T15:17:28Z

@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode. Regarding the design of using __HIP_FTZ, what do you think @antiagainst?

Thanks for the tests! Regarding __HIP_FTZ I think it's fine for now--as I commented in one inline thread #3816 (comment) we are still figuring out all the fine details of various ops regarding their fp math semantics; so there can be some back and forth. __HIP_FTZ is internal to the AMD backend here and it's well documented. So it won't be exposed to normal triton users and for folks touching the AMD backend it should also be clear for now what it is for. We can iterate on this with more math ops covered later.

… it is not needed (#4161) These need to be in this order to apply cleanly. Cherry pick #3816 and #3790 for release/3.0.x. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>

zhanglx13 force-pushed the exp2_ftz branch 3 times, most recently from 21f3f15 to 66231b2 Compare May 2, 2024 19:41

zhanglx13 changed the title ~~[WIP] [AMD] Enable denorm flushing for exp2~~ [AMD] Handle denorms properly for exp2 and exp May 3, 2024

zhanglx13 marked this pull request as ready for review May 3, 2024 02:56

zhanglx13 requested a review from antiagainst as a code owner May 3, 2024 02:56

zhanglx13 requested a review from zahimoud May 3, 2024 02:56

arsenm reviewed May 3, 2024

View reviewed changes

zhanglx13 marked this pull request as draft May 3, 2024 18:01

antiagainst requested changes May 11, 2024

View reviewed changes

zhanglx13 force-pushed the exp2_ftz branch 2 times, most recently from ee196f4 to 19fe675 Compare May 27, 2024 02:53

antiagainst approved these changes May 29, 2024

View reviewed changes

antiagainst marked this pull request as ready for review May 29, 2024 03:37

arsenm reviewed May 29, 2024

View reviewed changes

zhanglx13 added 4 commits May 30, 2024 22:06

[AMD] Enable denorm flushing for exp2

817c7ef

Preserve denorm for expOp

33d9971

Replace __ocml_ with llvm instrinsics

32dbc8a

Add comment for llvm.amdgcn.exp2.f32 and llvm.exp2.f32

d46b371

zhanglx13 force-pushed the exp2_ftz branch from ab56605 to d46b371 Compare May 31, 2024 03:06

zhanglx13 requested a review from ptillet as a code owner May 31, 2024 04:10

Add lit test for exp2 and exp codegen

54cb9af

zhanglx13 force-pushed the exp2_ftz branch from af497c9 to 54cb9af Compare May 31, 2024 15:42

zhanglx13 requested review from antiagainst and arsenm May 31, 2024 15:52

Merge branch 'main' into exp2_ftz

ad05f7a

antiagainst approved these changes Jun 5, 2024

View reviewed changes

antiagainst merged commit 05716d4 into triton-lang:main Jun 5, 2024

amjames mentioned this pull request Jun 18, 2024

Cherry picks for 3.0.x release #4159

Closed

amjames mentioned this pull request Jun 18, 2024

[AMD] Move MFMA shortcut check to not compute scratch buffer shape if it is not needed #4161

Merged

Conversation

zhanglx13 commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhanglx13 commented May 3, 2024

Uh oh!

arsenm commented May 3, 2024

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhanglx13 commented May 27, 2024

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antiagainst commented May 29, 2024

Uh oh!

zhanglx13 commented May 29, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhanglx13 commented May 29, 2024

Uh oh!

arsenm commented May 30, 2024

Uh oh!

zhanglx13 commented May 31, 2024

Uh oh!

antiagainst commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhanglx13 commented May 2, 2024 •

edited

Loading