[AMD] Handle denorms properly for exp2 and exp#3816
[AMD] Handle denorms properly for exp2 and exp#3816antiagainst merged 6 commits intotriton-lang:mainfrom
Conversation
21f3f15 to
66231b2
Compare
| // of the value of kernel arg `allow_flush_denorm`. | ||
| // 2. If __HIP_FTZ = 0, whether exp2 flushes denorms in input and output | ||
| // depends on the value of kernel arg `allow_flush_denorm`. | ||
| // 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument. |
There was a problem hiding this comment.
At least for clang, this is subtarget dependent. Anything gfx9+ defaults to not flushing f32
There was a problem hiding this comment.
I set the default to flushing to match the behavior on the NV path.
There was a problem hiding this comment.
Did you check with different SM version targets? I think they also switch the default at some level
There was a problem hiding this comment.
From NV libdevice, they use __CUDA_FTZ to control whether .ftz should be used. I checked the ptx and sass on A100. I can check H100 to make sure this is also the case there.
|
Converting to draft after offline discussion with Matt. We may need to figure out a way to have good perf without "breaking" the exp/exp2 ops. |
|
The global __HIP_FTZ/__CUDA_FTZ should imply the global FP mode setting. If those are enabled you should be able to set denormal-fp-math-f32=preserve-sign,preserve-sign on every function and the backend will handle this. You only need additional complexity if you need to contextually ignore denormal handling with standard IEEE handling |
antiagainst
left a comment
There was a problem hiding this comment.
Nice, thanks to really digging into all the details about this, @zhanglx13! Much appreciated. I have some comments inlined around a few issues. Aside from them, could you also update the language doc for exp2 to be clear on the denorm flushing behavior?
ee196f4 to
19fe675
Compare
|
@antiagainst Re adding doc for these functions. If you are referring to the official doc here, I'm wondering if we want to add denorm handling information for ALL the math functions at once? If so, it may take longer, since we haven't figured out the behavior for all the functions yet. |
antiagainst
left a comment
There was a problem hiding this comment.
Nice, thanks! Just two more nits. LGTM.
| // For non-FP32 input, call __ocml_exp2_f64 for higher-precision calculation | ||
| if (elemTy.getIntOrFloatBitWidth() != 32) | ||
| return {}; | ||
|
|
There was a problem hiding this comment.
Here worth a comment saying the former flushes denorm values and the later expands to LLVM instructions to handle denorm values + the former.
| // later pass will call __ocml_exp_f64 for higher-precision calculation | ||
| patterns.add<ExpOpConversionApprox>(typeConverter, axisInfoAnalysis, benefit); | ||
| // Exp2OpConversion will use llvm.exp2.f32 or llvm.amdgcn.exp2.f32 | ||
| // based on the __HIP_FTZ flag if the input type is FP32. For FP64 input, |
There was a problem hiding this comment.
Replace __HIP_FTZ with ftz here--ftz is a parameter to the function so it's clear what we are referring to.
I think we can just do exp/exp2 for now. It's fine to have it as a follow up pull request though. |
I'll do it in a follow up PR with an update for all the math function in the doc. |
arsenm
left a comment
There was a problem hiding this comment.
How is this supposed tested? I don't see any lit tests
| ## depends on the value of kernel arg `allow_flush_denorm`. | ||
| ## 3. __HIP_FTZ is default to 1 and not exposed as a kernel argument. | ||
| ## For now it is used as a controller for developers only. | ||
| __HIP_FTZ = True |
There was a problem hiding this comment.
Having this be specifically control exp makes no sense. The name, by lack of exp, and similarity to the module level __CUDA_FTZ, would imply this is changing the global floating point environment denormal mode. There should be no special case modes for a specific operation
There was a problem hiding this comment.
The plan is to use this __HIP_FTZ to control denorm handling behavior for all related math functions. This PR serves as a first step and selects exp2 because it affects performance of FA kernels.
__CUDA_FTZ is used only in the libdevice.bc. We don't have the DAZ_OPT flag in AMD ocml.bc anymore, so I exposed this flag here.
There was a problem hiding this comment.
I propose splitting it this way:
- Fix this to start directly emitting the llvm intrinsic directly
- Holistically change the mode to use FTZ, then you don't need to special case anything
There was a problem hiding this comment.
- Fix this to start directly emitting the llvm intrinsic directly
In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.
- Holistically change the mode to use FTZ, then you don't need to special case anything
Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations? Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.
There was a problem hiding this comment.
In this PR, exp2 and exp are lowered with llvm intrinsics directly for f32 inputs. For f64 inputs, I assume we cannot use llvm intrinsics, right? In follow up PR, we can fix other math functions by using llvm intrinsics directly.
Correct
Do you mean we should just use denorm-fp-math-f32 to control whether denorms are flushed for exp2, as with all other valu operations?
More precisely, this changes the default floating point mode.
Unfortunately, what we are trying to achieve here is to only flush denorms for exp2, which is the case on the nvidia side.
Sounds like an Nvidia bug to me, unless this is a specific "fast" exp function. You shouldn't be touching any global module flag for this
There was a problem hiding this comment.
Triton math op semantics are sort of "defined" by the nvidia instructions there given the history. The overall goal is to figure out the fine details for various ops (a lot there) and document them properly and make sure we are consistent. So I'd expect that's a lenghty procedure and we might not get everything perfect in one go. I think these are good points to follow up on that aren't blocking.
I don't see any lit tests for denorm handling on the NV side. So I assume we don't need to check for instructions to handle denorms. |
Bad test coverage elsewhere isn't a good reason to not have tests. Any compiler codegen change should have lit testing
There is an output change, it should be tested |
|
@arsenm I added lit tests for exp2 and exp to make sure the correct llvm intrinsics are used in different mode. |
Thanks for the tests! Regarding |
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms for `tl.math.exp`, which match their behaviors on Nvidia backend. More specifically, - denorm flushing for tl.math.exp2 with f32 inputs is controlled by `__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument. tl.math.exp2(f32) | NV | NV | AMD | AMD -- | -- | -- | -- | -- control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1 (default) | __HIP_FTZ=0 device lib | __nv_exp2f | __nv_exp2f | | llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f | llvm.amdgcn.exp2.f32 | llvm.exp2.f32 ptx | ex2.approx.ftz.f32 | ex2.approx.f32 | | sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to check and<br>adjust for<br>denorms - denorms are preserved for tl.math.exp2 with f64 inputs tl.math.exp2(f64) | NV | AMD -- | -- | -- device lib | __nv_exp2 | __ocml_exp2_f64 - denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the `.ftz` flag. tl.math.exp(f32) | NV | AMD -- | -- | -- llvm intrinsics | | llvm.exp2.f32 ptx | ex2.approx.f32 | tl.math.exp(f64) | NV | AMD -- | -- | -- device lib | __nv_exp | __ocml_exp_f64
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms for `tl.math.exp`, which match their behaviors on Nvidia backend. More specifically, - denorm flushing for tl.math.exp2 with f32 inputs is controlled by `__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument. tl.math.exp2(f32) | NV | NV | AMD | AMD -- | -- | -- | -- | -- control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1 (default) | __HIP_FTZ=0 device lib | __nv_exp2f | __nv_exp2f | | llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f | llvm.amdgcn.exp2.f32 | llvm.exp2.f32 ptx | ex2.approx.ftz.f32 | ex2.approx.f32 | | sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to check and<br>adjust for<br>denorms - denorms are preserved for tl.math.exp2 with f64 inputs tl.math.exp2(f64) | NV | AMD -- | -- | -- device lib | __nv_exp2 | __ocml_exp2_f64 - denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the `.ftz` flag. tl.math.exp(f32) | NV | AMD -- | -- | -- llvm intrinsics | | llvm.exp2.f32 ptx | ex2.approx.f32 | tl.math.exp(f64) | NV | AMD -- | -- | -- device lib | __nv_exp | __ocml_exp_f64
This PR enables denorm flushing for `tl.math.exp2` and preserves denorms for `tl.math.exp`, which match their behaviors on Nvidia backend. More specifically, - denorm flushing for tl.math.exp2 with f32 inputs is controlled by `__CUDA_FTZ` or `__HIP_FTZ` and the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument. tl.math.exp2(f32) | NV | NV | AMD | AMD -- | -- | -- | -- | -- control flag | __CUDA_FTZ=1 (default) | __CUDA_FTZ=0 | __HIP_FTZ=1 (default) | __HIP_FTZ=0 device lib | __nv_exp2f | __nv_exp2f | | llvm intrinsics | llvm.nvvm.ex2.approx.ftz.f | llvm.nvvm.ex2.approx.f | llvm.amdgcn.exp2.f32 | llvm.exp2.f32 ptx | ex2.approx.ftz.f32 | ex2.approx.f32 | | sass/amdgcn | MUFU.EX2 | MUFU.EX2<br>and instructions to<br>check and adjust for<br>denorms | v_exp_f32 | v_exp_f32<br>and instructions<br>to check and<br>adjust for<br>denorms - denorms are preserved for tl.math.exp2 with f64 inputs tl.math.exp2(f64) | NV | AMD -- | -- | -- device lib | __nv_exp2 | __ocml_exp2_f64 - denorms are preserved for tl.math.exp with both f32 and f64 inputs. Note that tl.math.exp(f32) on nv path is lowered with inline ptx directly without the `.ftz` flag. tl.math.exp(f32) | NV | AMD -- | -- | -- llvm intrinsics | | llvm.exp2.f32 ptx | ex2.approx.f32 | tl.math.exp(f64) | NV | AMD -- | -- | -- device lib | __nv_exp | __ocml_exp_f64
This PR enables denorm flushing for
tl.math.exp2and preserves denorms fortl.math.exp, which match their behaviors on Nvidia backend.More specifically,
__CUDA_FTZor__HIP_FTZand the default is set to flushing denorm. These flags can be set by developers, but are not exposed as kernel argument.and instructions to
check and adjust for
denorms
and instructions
to check and
adjust for
denorms
.ftzflag.