-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic cgemm kernel returns wrong result on AArch64 #729
Comments
OK, c repro here #include <cblas.h>
#include <stdio.h>
int main()
{
const float A[2] = {0.4713259, 0.14339028};
const float B[2] = {0.4713259, 0.14339028};
float C[2];
const float alpha[2] = {1, 0};
const float beta[2] = {0, 0};
cblas_cgemm(CblasRowMajor, CblasNoTrans, CblasConjTrans,
1, 1, 1, alpha, A, 1, B, 1, beta, C, 1);
printf("%g, %g\n", C[0], C[1]);
cblas_cgemm(CblasColMajor, CblasNoTrans, CblasConjTrans,
1, 1, 1, alpha, A, 1, B, 1, beta, C, 1);
printf("%g, %g\n", C[0], C[1]);
return 0;
} The output on AArch64 is
and on both x64 and AArch32 its
|
The relevant assembly generated is ldp s4, s5, [x3]
ldp s6, s3, [x4]
fmadd s2, s4, s6, s2
fmsub s1, s6, s5, s1
fmadd s2, s5, s3, s2
fmadd s1, s4, s3, s1 Where
I guess the solution of this problem could be not passing certain flag that allows the compiler to emit fma instructions? (Or maybe use manually optimized kernels that doesn't have this problem.....) |
Hi @yuyichao, What is the processor you're testing? How did you build OpenBLAS? AArch64 has had several patches for incorrect numeric results in GNU tools, could be that you're not enabling the patches required. For instance the Android cross-compiler didn't have those as of few months ago. |
The system is ArchLinux ARM. The PKGBUILD I used to compile openblas can be found here. The hardware is Jetson TX1. |
I cannot control the instructions for generic kernels. Do you try the latest develop branch? I think it should be assembly kernel for Cortex-A57. |
Do I need to enable the cortex-a57 kernel manually? With the develop branch the problem is still there. |
What's the name of library? Is it |
It's armv8. Seems that the README is out of date. I'll test passing appropriate make flag later. |
Not really, both the original test (5x5 matrix) and the reduced test (1x1 matrix) are still failing. |
In Cortex-A57 kernel, OpenBLAS already uses fmla and fmls to compute the result as following.
The Before
After
I have no idea how to fix it. Hardware limit? |
According to the feedback from the ARM engineer, the result |
Sorry I don't really have a C/Fortran repro yet but here's a julia repro.
The output is
but it should be
This doesn't happen for all input values and is also not limitted to
1x1
matrix (The original repro came from the julia linalg/symmetric test with a 10x5 matrix)Unless the parameter we used in julia to call this function is off, the problem should be in OpenBLAS since I've verified in the debugger that the C function get's the right argument.
The hardware I tested on has some issue with hardware watch point so I wasn't able to pin-point the exact line the wrong value was generated (or at least stored) but this seems to happen (for 1x1 at least) somewhere in the
cgemm_kernel_r
function defined inkernel/generic/zgemmkernel_2x2.c
. This also happens with the reference BLAS which I assume is using the same/similar kernel. Disassemble attached in case mis-compilation is a worry (compiled with native gcc-5.3).I understand this is ULP level error (for Float32 input) but this seems to happen for too many inputs and shouldn't be this bad for 1x1 size. This doesn't seem to be an issue on x86 either. (probably due to using a different kernel).
Is there a way to compile this generic kernel on x64 for testing?
The text was updated successfully, but these errors were encountered: