[libclc] replace float remquo with amd ocml implementation#177131
[libclc] replace float remquo with amd ocml implementation#177131
Conversation
Current implementation has two issues: * unconditionally soft flushes denormal. * can't pass OpenCL CTS test "test_bruteforce remquo" on intel gpu. This PR upstreams remquo implementation from https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs/ocml/src/remainderF_base.h It supports denormal and can pass OpenCL CTS test. Note __oclc_finite_only_opt is set to false as there is no dynamic dispatching for generic implementation. Number of LLVM IR instructions of function _Z6remquoffPU3AS5i increased from 96 to 678.
There was a problem hiding this comment.
Pull request overview
This PR replaces the current float remquo implementation with AMD's OCML version to fix denormal handling and OpenCL CTS test failures. The new implementation supports denormal numbers properly and passes the "test_bruteforce remquo" test on Intel GPUs.
Changes:
- Replaced the existing remquo algorithm with AMD OCML implementation from ROCm
- Removed soft flushing of denormals that was causing issues
- Added support for proper denormal handling and edge cases
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| libclc/clc/lib/generic/math/clc_remquo.inc | Complete rewrite of remquo implementation using AMD OCML algorithm with denormal support |
| libclc/clc/lib/generic/math/clc_remquo.cl | Updated includes to support new implementation (added fabs, copysign, frexp, nan, native_recip, rint, isfinite, isnan) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ex = ({ | ||
| int _exp; | ||
| __clc_frexp(ax, &_exp); | ||
| _exp; | ||
| }) - | ||
| 1; | ||
| ax = __clc_ldexp(({ | ||
| int _exp; | ||
| __clc_frexp(ax, &_exp); | ||
| }), | ||
| bits); | ||
| ey = ({ | ||
| int _exp; | ||
| __clc_frexp(ay, &_exp); | ||
| _exp; | ||
| }) - | ||
| 1; | ||
| ay = __clc_ldexp(({ | ||
| int _exp; | ||
| __clc_frexp(ay, &_exp); | ||
| }), | ||
| 1); |
There was a problem hiding this comment.
The frexp function is called twice on ax with identical logic but only the exponent is used the first time and only the mantissa the second time. This is inefficient and duplicative. Consider storing both the mantissa and exponent from a single frexp call and reusing them.
| ex = ({ | |
| int _exp; | |
| __clc_frexp(ax, &_exp); | |
| _exp; | |
| }) - | |
| 1; | |
| ax = __clc_ldexp(({ | |
| int _exp; | |
| __clc_frexp(ax, &_exp); | |
| }), | |
| bits); | |
| ey = ({ | |
| int _exp; | |
| __clc_frexp(ay, &_exp); | |
| _exp; | |
| }) - | |
| 1; | |
| ay = __clc_ldexp(({ | |
| int _exp; | |
| __clc_frexp(ay, &_exp); | |
| }), | |
| 1); | |
| int _exp_ax; | |
| float mant_ax = __clc_frexp(ax, &_exp_ax); | |
| ex = _exp_ax - 1; | |
| ax = __clc_ldexp(mant_ax, bits); | |
| int _exp_ay; | |
| float mant_ay = __clc_frexp(ay, &_exp_ay); | |
| ey = _exp_ay - 1; | |
| ay = __clc_ldexp(mant_ay, 1); |
| bool __oclc_finite_only_opt = false; | ||
| if (!__oclc_finite_only_opt) { | ||
| ret = y == 0.0f ? __clc_nan(0) : ret; | ||
| q7 = y == 0.0f ? 0 : q7; | ||
| bool c = !__clc_isnan(y) && __clc_isfinite(x); | ||
| ret = c ? ret : __clc_nan(0); | ||
| q7 = c ? q7 : 0; | ||
| } | ||
|
|
There was a problem hiding this comment.
The variable name __oclc_finite_only_opt uses a naming convention that suggests it's a configuration constant or macro, but it's declared as a local boolean variable that's always false. This is confusing and misleading. Consider renaming it to something like finite_only_disabled or removing it entirely if it's meant to be a temporary placeholder.
| bool __oclc_finite_only_opt = false; | |
| if (!__oclc_finite_only_opt) { | |
| ret = y == 0.0f ? __clc_nan(0) : ret; | |
| q7 = y == 0.0f ? 0 : q7; | |
| bool c = !__clc_isnan(y) && __clc_isfinite(x); | |
| ret = c ? ret : __clc_nan(0); | |
| q7 = c ? q7 : 0; | |
| } | |
| ret = y == 0.0f ? __clc_nan(0) : ret; | |
| q7 = y == 0.0f ? 0 : q7; | |
| bool c = !__clc_isnan(y) && __clc_isfinite(x); | |
| ret = c ? ret : __clc_nan(0); | |
| q7 = c ? q7 : 0; |
| int qsgn = 1 + (((__clc_as_int(x) ^ __clc_as_int(y)) >> 31) << 1); | ||
| float t = __clc_fma(y, -(float)qsgn, x); | ||
| ret = c ? t | ||
| : (__builtin_isfpclass(__builtin_canonicalizef(0x1p-149f), 0x0040) |
There was a problem hiding this comment.
This should definitely not be inlining the DAZ_OPT hack. Either preserve it, or unconditionally canonicalize
There was a problem hiding this comment.
done, changed to unconditionally canonicalize
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions cl,inc -- libclc/clc/lib/generic/math/clc_remquo.cl libclc/clc/lib/generic/math/clc_remquo.inc --diff_from_common_commit
View the diff from clang-format here.diff --git a/libclc/clc/lib/generic/math/clc_remquo.inc b/libclc/clc/lib/generic/math/clc_remquo.inc
index 79eef077b..836dc703c 100644
--- a/libclc/clc/lib/generic/math/clc_remquo.inc
+++ b/libclc/clc/lib/generic/math/clc_remquo.inc
@@ -66,7 +66,7 @@ _CLC_DEF _CLC_OVERLOAD float __clc_remquo(float x, float y,
} else {
ret = x;
q7 = 0;
- bool c = (ay < 0x1.0p+127f & 2.0f * ax > ay) | (ax > 0.5f * ay);
+ bool c = (ay<0x1.0p+127f & 2.0f * ax> ay) | (ax > 0.5f * ay);
int qsgn = 1 + (((__clc_as_int(x) ^ __clc_as_int(y)) >> 31) << 1);
float t = __clc_fma(y, -(float)qsgn, x);
|
Current implementation has two issues: * unconditionally soft flushes denormal. * can't pass OpenCL CTS test "test_bruteforce remquo" on intel gpu. This PR upstreams remquo implementation from https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs/ocml/src/remainderF_base.h It supports denormal and can pass OpenCL CTS test. Number of LLVM IR instructions of function _Z6remquoffPU3AS5i increased from 96 to 680. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Current implementation has two issues: * unconditionally soft flushes denormal. * can't pass OpenCL CTS test "test_bruteforce remquo" on intel gpu. This PR upstreams remquo implementation from https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs/ocml/src/remainderF_base.h It supports denormal and can pass OpenCL CTS test. Number of LLVM IR instructions of function _Z6remquoffPU3AS5i increased from 96 to 680. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
I ran into this issue.
|
Either tell the SPIR-V backend people to support the canonicalize node or put |
|
SPIRV must implement canonicalize |
So should we revert this until that happens? |
revert in #181443 I discussed the issue with Ben Ashbaugh. We can propose a SPIR-V extension to add canonicalize instruction to SPIR-V. What do you think? |
…entation" (#181443) Reverts llvm/llvm-project#177131 It broke SPIRV target: error in backend: unable to legalize instruction: %88:fid(s32) = G_FCANONICALIZE
…lvm#181443) Reverts llvm#177131 It broke SPIRV target: error in backend: unable to legalize instruction: %88:fid(s32) = G_FCANONICALIZE
Current implementation has two issues:
This PR upstreams remquo implementation from
https://github.com/ROCm/llvm-project/tree/amd-staging/amd/device-libs/ocml/src/remainderF_base.h It supports denormal and can pass OpenCL CTS test.
Number of LLVM IR instructions of function _Z6remquoffPU3AS5i increased from 96 to 680.