Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic cgemm kernel returns wrong result on AArch64 #729

Closed
yuyichao opened this issue Jan 2, 2016 · 11 comments
Closed

Generic cgemm kernel returns wrong result on AArch64 #729

yuyichao opened this issue Jan 2, 2016 · 11 comments

Comments

@yuyichao
Copy link
Contributor

yuyichao commented Jan 2, 2016

Sorry I don't really have a C/Fortran repro yet but here's a julia repro.

#!/usr/bin/julia -f

@noinline function k(C, A, B)
    ccall((:cgemm_, BLAS.libblas), Void,
          (Ptr{UInt8}, Ptr{UInt8}, Ptr{BLAS.BlasInt}, Ptr{BLAS.BlasInt},
           Ptr{BLAS.BlasInt}, Ptr{Complex64}, Ptr{Complex64}, Ptr{BLAS.BlasInt},
           Ptr{Complex64}, Ptr{BLAS.BlasInt}, Ptr{Complex64}, Ptr{Complex64},
           Ptr{BLAS.BlasInt}),
          &'N', &'C', &1, &1,
          &1, &one(Complex64), A, &1,
          B, &1, &zero(Complex64), C, &1)
end

function f1(a, rng)
    b = a[:, :]
    A = Matrix{Complex64}(1, 1)
    B = Matrix{Complex64}(1, 1)
    k(A, a, b)
    # @code_llvm k(A, a, b)
    k(B, a, a)
    @show A
    @show B
    @show A - B
end

a = Complex64[0.4713259 + 0.14339028im]'' # Make this a matrix

The output is

A = Complex{Float32}[0.24270888 - 1.2801453e-9im]
B = Complex{Float32}[0.24270888 - 1.2801453e-9im]
A - B = Complex{Float32}[0.0 + 0.0im]

but it should be

A = Complex{Float32}[0.24270888 - 0.0im]
B = Complex{Float32}[0.24270888 - 0.0im]
A - B = Complex{Float32}[0.0 + 0.0im]

This doesn't happen for all input values and is also not limitted to 1x1 matrix (The original repro came from the julia linalg/symmetric test with a 10x5 matrix)

Unless the parameter we used in julia to call this function is off, the problem should be in OpenBLAS since I've verified in the debugger that the C function get's the right argument.

The hardware I tested on has some issue with hardware watch point so I wasn't able to pin-point the exact line the wrong value was generated (or at least stored) but this seems to happen (for 1x1 at least) somewhere in the cgemm_kernel_r function defined in kernel/generic/zgemmkernel_2x2.c. This also happens with the reference BLAS which I assume is using the same/similar kernel. Disassemble attached in case mis-compilation is a worry (compiled with native gcc-5.3).

I understand this is ULP level error (for Float32 input) but this seems to happen for too many inputs and shouldn't be this bad for 1x1 size. This doesn't seem to be an issue on x86 either. (probably due to using a different kernel).

Is there a way to compile this generic kernel on x64 for testing?

(gdb) disassemble cgemm_kernel_r
Dump of assembler code for function cgemm_kernel_r:
   0x0000007db0bbffd0 <+0>:     stp     x29, x30, [sp,#-144]!
   0x0000007db0bbffd4 <+4>:     fmov    s24, s1
   0x0000007db0bbffd8 <+8>:     mov     x29, sp
   0x0000007db0bbffdc <+12>:    stp     x23, x24, [sp,#48]
   0x0000007db0bbffe0 <+16>:    stp     d8, d9, [sp,#80]
   0x0000007db0bbffe4 <+20>:    add     x23, x1, x1, lsr #63
   0x0000007db0bbffe8 <+24>:    stp     d10, d11, [sp,#96]
   0x0000007db0bbffec <+28>:    stp     d12, d13, [sp,#112]
   0x0000007db0bbfff0 <+32>:    asr     x23, x23, #1
   0x0000007db0bbfff4 <+36>:    stp     d14, d15, [sp,#128]
   0x0000007db0bbfff8 <+40>:    stp     x19, x20, [sp,#16]
   0x0000007db0bbfffc <+44>:    stp     x21, x22, [sp,#32]
   0x0000007db0bc0000 <+48>:    stp     x25, x26, [sp,#64]
   0x0000007db0bc0004 <+52>:    cmp     x23, xzr
   0x0000007db0bc0008 <+56>:    mov     x19, x1
   0x0000007db0bc000c <+60>:    b.le    0x7db0bc03c8 <cgemm_kernel_r+1016>
   0x0000007db0bc0010 <+64>:    cmp     x2, xzr
   0x0000007db0bc0014 <+68>:    add     x11, x2, #0x3
   0x0000007db0bc0018 <+72>:    csel    x11, x11, x2, lt
   0x0000007db0bc001c <+76>:    add     x16, x0, x0, lsr #63
   0x0000007db0bc0020 <+80>:    asr     x11, x11, #2
   0x0000007db0bc0024 <+84>:    asr     x16, x16, #1
   0x0000007db0bc0028 <+88>:    and     x12, x2, #0x3
   0x0000007db0bc002c <+92>:    lsl     x24, x2, #4
   0x0000007db0bc0030 <+96>:    lsl     x21, x6, #4
   0x0000007db0bc0034 <+100>:   add     x30, x5, x6, lsl #3
   0x0000007db0bc0038 <+104>:   and     x25, x0, #0x1
   0x0000007db0bc003c <+108>:   mov     x1, x5
   0x0000007db0bc0040 <+112>:   lsl     x17, x11, #6
   0x0000007db0bc0044 <+116>:   lsl     x18, x12, #4
   0x0000007db0bc0048 <+120>:   mov     x15, x4
   0x0000007db0bc004c <+124>:   mov     x20, #0x0                       // #0
   0x0000007db0bc0050 <+128>:   lsl     x22, x16, #4
   0x0000007db0bc0054 <+132>:   cmp     x16, xzr
   0x0000007db0bc0058 <+136>:   mov     x10, x30
   0x0000007db0bc005c <+140>:   b.le    0x7db0bc0560 <cgemm_kernel_r+1424>
   0x0000007db0bc0060 <+144>:   add     x26, x15, x17
   0x0000007db0bc0064 <+148>:   mov     x13, x1
   0x0000007db0bc0068 <+152>:   mov     x8, x3
   0x0000007db0bc006c <+156>:   mov     x14, #0x0                       // #0
   0x0000007db0bc0070 <+160>:   fmov    s18, wzr
   0x0000007db0bc0074 <+164>:   cmp     x11, xzr
   0x0000007db0bc0078 <+168>:   b.le    0x7db0bc052c <cgemm_kernel_r+1372>
   0x0000007db0bc007c <+172>:   fmov    s7, s18
   0x0000007db0bc0080 <+176>:   mov     x7, x15
   0x0000007db0bc0084 <+180>:   fmov    s6, s18
   0x0000007db0bc0088 <+184>:   mov     x6, x8
   0x0000007db0bc008c <+188>:   fmov    s5, s18
   0x0000007db0bc0090 <+192>:   mov     x9, #0x0                        // #0
   0x0000007db0bc0094 <+196>:   fmov    s28, s18
   0x0000007db0bc0098 <+200>:   fmov    s4, s18
   0x0000007db0bc009c <+204>:   fmov    s3, s18
   0x0000007db0bc00a0 <+208>:   fmov    s2, s18
   0x0000007db0bc00a4 <+212>:   ldp     s26, s8, [x6]
   0x0000007db0bc00a8 <+216>:   ldr     s9, [x6,#28]
   0x0000007db0bc00ac <+220>:   ldp     s1, s25, [x6,#8]
   0x0000007db0bc00b0 <+224>:   ldr     s13, [x7,#16]
   0x0000007db0bc00b4 <+228>:   ldp     s20, s12, [x7]
   0x0000007db0bc00b8 <+232>:   ldr     s31, [x6,#40]
   0x0000007db0bc00bc <+236>:   ldp     s23, s16, [x7,#8]
   0x0000007db0bc00c0 <+240>:   ldr     s15, [x7,#32]
   0x0000007db0bc00c4 <+244>:   ldp     s21, s19, [x7,#24]
   0x0000007db0bc00c8 <+248>:   ldr     s14, [x7,#40]
   0x0000007db0bc00cc <+252>:   ldp     s10, s11, [x6,#16]
   0x0000007db0bc00d0 <+256>:   ldr     s27, [x6,#52]
   0x0000007db0bc00d4 <+260>:   fmadd   s17, s20, s25, s28
   0x0000007db0bc00d8 <+264>:   add     x9, x9, #0x1
   0x0000007db0bc00dc <+268>:   fmadd   s5, s26, s23, s5
   0x0000007db0bc00e0 <+272>:   cmp     x9, x11
   0x0000007db0bc00e4 <+276>:   fmadd   s7, s1, s23, s7
   0x0000007db0bc00e8 <+280>:   add     x7, x7, #0x40
   0x0000007db0bc00ec <+284>:   fmadd   s2, s26, s20, s2
   0x0000007db0bc00f0 <+288>:   fmadd   s3, s20, s8, s3
   0x0000007db0bc00f4 <+292>:   fmadd   s4, s20, s1, s4
   0x0000007db0bc00f8 <+296>:   fmadd   s6, s8, s23, s6
   0x0000007db0bc00fc <+300>:   fmadd   s23, s25, s23, s18
   0x0000007db0bc0100 <+304>:   fmsub   s20, s12, s1, s17
   0x0000007db0bc0104 <+308>:   ldr     s28, [x6,#24]
   0x0000007db0bc0108 <+312>:   ldur    s18, [x7,#-44]
   0x0000007db0bc010c <+316>:   fmadd   s2, s8, s12, s2
   0x0000007db0bc0110 <+320>:   fmsub   s3, s26, s12, s3
   0x0000007db0bc0114 <+324>:   fmadd   s4, s12, s25, s4
   0x0000007db0bc0118 <+328>:   fmadd   s8, s8, s16, s5
   0x0000007db0bc011c <+332>:   fmsub   s22, s1, s16, s23
   0x0000007db0bc0120 <+336>:   fmsub   s6, s26, s16, s6
   0x0000007db0bc0124 <+340>:   ldp     s17, s26, [x6,#44]
   0x0000007db0bc0128 <+344>:   fmadd   s25, s25, s16, s7
   0x0000007db0bc012c <+348>:   ldp     s16, s5, [x6,#32]
   0x0000007db0bc0130 <+352>:   ldur    s12, [x7,#-28]
   0x0000007db0bc0134 <+356>:   add     x6, x6, #0x40
   0x0000007db0bc0138 <+360>:   fmadd   s2, s10, s13, s2
   0x0000007db0bc013c <+364>:   fmadd   s3, s13, s11, s3
   0x0000007db0bc0140 <+368>:   fmadd   s4, s13, s28, s4
   0x0000007db0bc0144 <+372>:   fmadd   s1, s10, s21, s8
   0x0000007db0bc0148 <+376>:   fmadd   s22, s9, s21, s22
   0x0000007db0bc014c <+380>:   fmadd   s6, s11, s21, s6
   0x0000007db0bc0150 <+384>:   fmadd   s7, s28, s21, s25
   0x0000007db0bc0154 <+388>:   fmadd   s13, s13, s9, s20
   0x0000007db0bc0158 <+392>:   fmadd   s2, s11, s18, s2
   0x0000007db0bc015c <+396>:   fmsub   s3, s10, s18, s3
   0x0000007db0bc0160 <+400>:   fmadd   s4, s18, s9, s4
   0x0000007db0bc0164 <+404>:   fmadd   s11, s11, s19, s1
   0x0000007db0bc0168 <+408>:   fmsub   s22, s28, s19, s22
   0x0000007db0bc016c <+412>:   fmsub   s6, s10, s19, s6
   0x0000007db0bc0170 <+416>:   fmadd   s7, s9, s19, s7
   0x0000007db0bc0174 <+420>:   fmsub   s20, s18, s28, s13
   0x0000007db0bc0178 <+424>:   fmadd   s2, s16, s15, s2
   0x0000007db0bc017c <+428>:   fmadd   s3, s15, s5, s3
   0x0000007db0bc0180 <+432>:   fmadd   s4, s15, s31, s4
   0x0000007db0bc0184 <+436>:   fmadd   s11, s16, s14, s11
   0x0000007db0bc0188 <+440>:   fmadd   s21, s17, s14, s22
   0x0000007db0bc018c <+444>:   fmadd   s6, s5, s14, s6
   0x0000007db0bc0190 <+448>:   fmadd   s7, s31, s14, s7
   0x0000007db0bc0194 <+452>:   fmadd   s20, s15, s17, s20
   0x0000007db0bc0198 <+456>:   ldur    s8, [x7,#-20]
   0x0000007db0bc019c <+460>:   fmadd   s2, s5, s12, s2
   0x0000007db0bc01a0 <+464>:   fmadd   s4, s12, s17, s4
   0x0000007db0bc01a4 <+468>:   fmsub   s3, s16, s12, s3
   0x0000007db0bc01a8 <+472>:   fmadd   s5, s5, s8, s11
   0x0000007db0bc01ac <+476>:   fmsub   s6, s16, s8, s6
   0x0000007db0bc01b0 <+480>:   fmadd   s7, s17, s8, s7
   0x0000007db0bc01b4 <+484>:   fmsub   s20, s12, s31, s20
   0x0000007db0bc01b8 <+488>:   fmsub   s19, s31, s8, s21
   0x0000007db0bc01bc <+492>:   ldur    s30, [x7,#-16]
   0x0000007db0bc01c0 <+496>:   ldur    s29, [x7,#-8]
   0x0000007db0bc01c4 <+500>:   ldur    s23, [x6,#-8]
   0x0000007db0bc01c8 <+504>:   ldur    s25, [x6,#-4]
   0x0000007db0bc01cc <+508>:   fmadd   s2, s26, s30, s2
   0x0000007db0bc01d0 <+512>:   fmadd   s3, s30, s27, s3
   0x0000007db0bc01d4 <+516>:   fmadd   s4, s30, s23, s4
   0x0000007db0bc01d8 <+520>:   fmadd   s17, s30, s25, s20
   0x0000007db0bc01dc <+524>:   fmadd   s5, s26, s29, s5
   0x0000007db0bc01e0 <+528>:   fmadd   s6, s27, s29, s6
   0x0000007db0bc01e4 <+532>:   fmadd   s7, s23, s29, s7
   0x0000007db0bc01e8 <+536>:   fmadd   s18, s25, s29, s19
   0x0000007db0bc01ec <+540>:   ldur    s28, [x7,#-12]
   0x0000007db0bc01f0 <+544>:   ldur    s1, [x7,#-4]
   0x0000007db0bc01f4 <+548>:   fmadd   s2, s27, s28, s2
   0x0000007db0bc01f8 <+552>:   fmsub   s3, s26, s28, s3
   0x0000007db0bc01fc <+556>:   fmadd   s4, s28, s25, s4
   0x0000007db0bc0200 <+560>:   fmadd   s5, s27, s1, s5
   0x0000007db0bc0204 <+564>:   fmsub   s28, s28, s23, s17
   0x0000007db0bc0208 <+568>:   fmsub   s6, s26, s1, s6
   0x0000007db0bc020c <+572>:   fmadd   s7, s25, s1, s7
   0x0000007db0bc0210 <+576>:   fmsub   s18, s23, s1, s18
   0x0000007db0bc0214 <+580>:   b.ne    0x7db0bc00a4 <cgemm_kernel_r+212>
   0x0000007db0bc0218 <+584>:   add     x8, x8, x17
   0x0000007db0bc021c <+588>:   mov     x6, x26
   0x0000007db0bc0220 <+592>:   cbz     x12, 0x7db0bc0294 <cgemm_kernel_r+708>
   0x0000007db0bc0224 <+596>:   mov     x7, x8
   0x0000007db0bc0228 <+600>:   mov     x9, #0x0                        // #0
   0x0000007db0bc022c <+604>:   ldp     s17, s20, [x6]
   0x0000007db0bc0230 <+608>:   add     x9, x9, #0x1
   0x0000007db0bc0234 <+612>:   ldp     s16, s23, [x6,#8]
   0x0000007db0bc0238 <+616>:   cmp     x9, x12
   0x0000007db0bc023c <+620>:   ldp     s21, s22, [x7]
   0x0000007db0bc0240 <+624>:   add     x6, x6, #0x10
   0x0000007db0bc0244 <+628>:   ldp     s1, s19, [x7,#8]
   0x0000007db0bc0248 <+632>:   add     x7, x7, #0x10
   0x0000007db0bc024c <+636>:   fmadd   s2, s21, s17, s2
   0x0000007db0bc0250 <+640>:   fmadd   s3, s17, s22, s3
   0x0000007db0bc0254 <+644>:   fmadd   s4, s17, s1, s4
   0x0000007db0bc0258 <+648>:   fmadd   s5, s21, s16, s5
   0x0000007db0bc025c <+652>:   fmadd   s6, s22, s16, s6
   0x0000007db0bc0260 <+656>:   fmadd   s7, s1, s16, s7
   0x0000007db0bc0264 <+660>:   fmadd   s17, s17, s19, s28
   0x0000007db0bc0268 <+664>:   fmadd   s16, s19, s16, s18
   0x0000007db0bc026c <+668>:   fmadd   s2, s22, s20, s2
   0x0000007db0bc0270 <+672>:   fmsub   s3, s21, s20, s3
   0x0000007db0bc0274 <+676>:   fmadd   s4, s20, s19, s4
   0x0000007db0bc0278 <+680>:   fmadd   s5, s22, s23, s5
   0x0000007db0bc027c <+684>:   fmsub   s6, s21, s23, s6
   0x0000007db0bc0280 <+688>:   fmadd   s7, s19, s23, s7
   0x0000007db0bc0284 <+692>:   fmsub   s28, s20, s1, s17
   0x0000007db0bc0288 <+696>:   fmsub   s18, s1, s23, s16
   0x0000007db0bc028c <+700>:   b.ne    0x7db0bc022c <cgemm_kernel_r+604>
   0x0000007db0bc0290 <+704>:   add     x8, x8, x18
   0x0000007db0bc0294 <+708>:   ldp     s9, s8, [x13]
   0x0000007db0bc0298 <+712>:   add     x10, x10, #0x10
   0x0000007db0bc029c <+716>:   ldp     s17, s1, [x13,#8]
   0x0000007db0bc02a0 <+720>:   add     x13, x13, #0x10
   0x0000007db0bc02a4 <+724>:   add     x14, x14, #0x1
   0x0000007db0bc02a8 <+728>:   cmp     x14, x16
   0x0000007db0bc02ac <+732>:   fmadd   s8, s0, s3, s8
   0x0000007db0bc02b0 <+736>:   fmadd   s9, s0, s2, s9
   0x0000007db0bc02b4 <+740>:   fmadd   s1, s0, s28, s1
   0x0000007db0bc02b8 <+744>:   fmadd   s17, s0, s4, s17
   0x0000007db0bc02bc <+748>:   fmadd   s2, s24, s2, s8
   0x0000007db0bc02c0 <+752>:   fmsub   s3, s24, s3, s9
   0x0000007db0bc02c4 <+756>:   fmadd   s4, s24, s4, s1
   0x0000007db0bc02c8 <+760>:   fmsub   s17, s24, s28, s17
   0x0000007db0bc02cc <+764>:   stp     s3, s2, [x13,#-16]
   0x0000007db0bc02d0 <+768>:   stp     s17, s4, [x13,#-8]
   0x0000007db0bc02d4 <+772>:   ldp     s3, s1, [x10,#-16]
   0x0000007db0bc02d8 <+776>:   ldp     s23, s2, [x10,#-8]
   0x0000007db0bc02dc <+780>:   fmadd   s1, s0, s6, s1
   0x0000007db0bc02e0 <+784>:   fmadd   s3, s0, s5, s3
   0x0000007db0bc02e4 <+788>:   fmadd   s23, s0, s7, s23
   0x0000007db0bc02e8 <+792>:   fmadd   s2, s0, s18, s2
   0x0000007db0bc02ec <+796>:   fmadd   s5, s24, s5, s1
   0x0000007db0bc02f0 <+800>:   fmsub   s6, s24, s6, s3
   0x0000007db0bc02f4 <+804>:   fmsub   s23, s24, s18, s23
   0x0000007db0bc02f8 <+808>:   fmadd   s7, s24, s7, s2
   0x0000007db0bc02fc <+812>:   stp     s6, s5, [x10,#-16]
   0x0000007db0bc0300 <+816>:   stp     s23, s7, [x10,#-8]
   0x0000007db0bc0304 <+820>:   b.ne    0x7db0bc0070 <cgemm_kernel_r+160>
   0x0000007db0bc0308 <+824>:   add     x9, x1, x22
   0x0000007db0bc030c <+828>:   add     x10, x22, x30
   0x0000007db0bc0310 <+832>:   cbz     x25, 0x7db0bc03a8 <cgemm_kernel_r+984>
   0x0000007db0bc0314 <+836>:   fmov    s3, wzr
   0x0000007db0bc0318 <+840>:   cmp     x2, xzr
   0x0000007db0bc031c <+844>:   b.le    0x7db0bc056c <cgemm_kernel_r+1436>
   0x0000007db0bc0320 <+848>:   fmov    s6, s3
   0x0000007db0bc0324 <+852>:   mov     x6, x15
   0x0000007db0bc0328 <+856>:   fmov    s5, s3
   0x0000007db0bc032c <+860>:   mov     x7, #0x0                        // #0
   0x0000007db0bc0330 <+864>:   fmov    s4, s3
   0x0000007db0bc0334 <+868>:   ldp     s1, s2, [x8]
   0x0000007db0bc0338 <+872>:   ldr     s7, [x6,#8]
   0x0000007db0bc033c <+876>:   ldp     s9, s8, [x6]
   0x0000007db0bc0340 <+880>:   add     x7, x7, #0x1
   0x0000007db0bc0344 <+884>:   cmp     x2, x7
   0x0000007db0bc0348 <+888>:   add     x8, x8, #0x8
   0x0000007db0bc034c <+892>:   add     x6, x6, #0x10
   0x0000007db0bc0350 <+896>:   fmadd   s6, s1, s7, s6
   0x0000007db0bc0354 <+900>:   fmadd   s3, s2, s7, s3
   0x0000007db0bc0358 <+904>:   fmadd   s4, s1, s9, s4
   0x0000007db0bc035c <+908>:   fmadd   s5, s9, s2, s5
   0x0000007db0bc0360 <+912>:   ldur    s7, [x6,#-4]
   0x0000007db0bc0364 <+916>:   fmadd   s6, s2, s7, s6
   0x0000007db0bc0368 <+920>:   fmsub   s3, s1, s7, s3
   0x0000007db0bc036c <+924>:   fmadd   s4, s2, s8, s4
   0x0000007db0bc0370 <+928>:   fmsub   s5, s1, s8, s5
   0x0000007db0bc0374 <+932>:   b.ne    0x7db0bc0334 <cgemm_kernel_r+868>
   0x0000007db0bc0378 <+936>:   ldp     s1, s2, [x9]
   0x0000007db0bc037c <+940>:   fmadd   s2, s0, s5, s2
   0x0000007db0bc0380 <+944>:   fmadd   s1, s0, s4, s1
   0x0000007db0bc0384 <+948>:   fmadd   s4, s24, s4, s2
   0x0000007db0bc0388 <+952>:   fmsub   s5, s24, s5, s1
   0x0000007db0bc038c <+956>:   stp     s5, s4, [x9]
   0x0000007db0bc0390 <+960>:   ldp     s1, s2, [x10]
   0x0000007db0bc0394 <+964>:   fmadd   s2, s0, s3, s2
   0x0000007db0bc0398 <+968>:   fmadd   s1, s0, s6, s1
   0x0000007db0bc039c <+972>:   fmadd   s6, s24, s6, s2
   0x0000007db0bc03a0 <+976>:   fmsub   s3, s24, s3, s1
   0x0000007db0bc03a4 <+980>:   stp     s3, s6, [x10]
   0x0000007db0bc03a8 <+984>:   add     x20, x20, #0x1
   0x0000007db0bc03ac <+988>:   add     x15, x15, x24
   0x0000007db0bc03b0 <+992>:   cmp     x20, x23
   0x0000007db0bc03b4 <+996>:   add     x1, x1, x21
   0x0000007db0bc03b8 <+1000>:  add     x30, x30, x21
   0x0000007db0bc03bc <+1004>:  b.ne    0x7db0bc0054 <cgemm_kernel_r+132>
   0x0000007db0bc03c0 <+1008>:  madd    x4, x20, x24, x4
   0x0000007db0bc03c4 <+1012>:  madd    x5, x20, x21, x5
   0x0000007db0bc03c8 <+1016>:  tbz     w19, #0, 0x7db0bc0500 <cgemm_kernel_r+1328>
   0x0000007db0bc03cc <+1020>:  add     x9, x0, x0, lsr #63
   0x0000007db0bc03d0 <+1024>:  and     x11, x0, #0x1
   0x0000007db0bc03d4 <+1028>:  asr     x9, x9, #1
   0x0000007db0bc03d8 <+1032>:  lsl     x10, x2, #4
   0x0000007db0bc03dc <+1036>:  cmp     x9, xzr
   0x0000007db0bc03e0 <+1040>:  mov     x7, x5
   0x0000007db0bc03e4 <+1044>:  mov     x8, #0x0                        // #0
   0x0000007db0bc03e8 <+1048>:  lsl     x12, x9, #4
   0x0000007db0bc03ec <+1052>:  b.le    0x7db0bc04a0 <cgemm_kernel_r+1232>
   0x0000007db0bc03f0 <+1056>:  fmov    s1, wzr
   0x0000007db0bc03f4 <+1060>:  cmp     x2, xzr
   0x0000007db0bc03f8 <+1064>:  b.le    0x7db0bc0550 <cgemm_kernel_r+1408>
   0x0000007db0bc03fc <+1068>:  fmov    s5, s1
   0x0000007db0bc0400 <+1072>:  mov     x1, x4
   0x0000007db0bc0404 <+1076>:  fmov    s4, s1
   0x0000007db0bc0408 <+1080>:  mov     x0, x3
   0x0000007db0bc040c <+1084>:  fmov    s3, s1
   0x0000007db0bc0410 <+1088>:  mov     x6, #0x0                        // #0
   0x0000007db0bc0414 <+1092>:  ldp     s8, s9, [x0]
   0x0000007db0bc0418 <+1096>:  ldr     s2, [x1]
   0x0000007db0bc041c <+1100>:  ldp     s6, s7, [x0,#8]
   0x0000007db0bc0420 <+1104>:  add     x6, x6, #0x1
   0x0000007db0bc0424 <+1108>:  cmp     x2, x6
   0x0000007db0bc0428 <+1112>:  add     x0, x0, #0x10
   0x0000007db0bc042c <+1116>:  add     x1, x1, #0x8
   0x0000007db0bc0430 <+1120>:  fmadd   s3, s8, s2, s3
   0x0000007db0bc0434 <+1124>:  fmadd   s4, s2, s9, s4
   0x0000007db0bc0438 <+1128>:  fmadd   s5, s2, s6, s5
   0x0000007db0bc043c <+1132>:  fmadd   s2, s2, s7, s1
   0x0000007db0bc0440 <+1136>:  ldur    s1, [x1,#-4]
   0x0000007db0bc0444 <+1140>:  fmadd   s3, s9, s1, s3
   0x0000007db0bc0448 <+1144>:  fmsub   s4, s8, s1, s4
   0x0000007db0bc044c <+1148>:  fmadd   s5, s1, s7, s5
   0x0000007db0bc0450 <+1152>:  fmsub   s1, s1, s6, s2
   0x0000007db0bc0454 <+1156>:  b.ne    0x7db0bc0414 <cgemm_kernel_r+1092>
   0x0000007db0bc0458 <+1160>:  add     x3, x3, x10
   0x0000007db0bc045c <+1164>:  ldp     s8, s7, [x7]
   0x0000007db0bc0460 <+1168>:  add     x8, x8, #0x1
   0x0000007db0bc0464 <+1172>:  ldp     s2, s6, [x7,#8]
   0x0000007db0bc0468 <+1176>:  add     x7, x7, #0x10
   0x0000007db0bc046c <+1180>:  cmp     x8, x9
   0x0000007db0bc0470 <+1184>:  fmadd   s7, s0, s4, s7
   0x0000007db0bc0474 <+1188>:  fmadd   s8, s0, s3, s8
   0x0000007db0bc0478 <+1192>:  fmadd   s6, s0, s1, s6
   0x0000007db0bc047c <+1196>:  fmadd   s2, s0, s5, s2
   0x0000007db0bc0480 <+1200>:  fmadd   s3, s24, s3, s7
   0x0000007db0bc0484 <+1204>:  fmsub   s4, s24, s4, s8
   0x0000007db0bc0488 <+1208>:  fmadd   s5, s24, s5, s6
   0x0000007db0bc048c <+1212>:  fmsub   s1, s24, s1, s2
   0x0000007db0bc0490 <+1216>:  stp     s4, s3, [x7,#-16]
   0x0000007db0bc0494 <+1220>:  stp     s1, s5, [x7,#-8]
   0x0000007db0bc0498 <+1224>:  b.ne    0x7db0bc03f0 <cgemm_kernel_r+1056>
   0x0000007db0bc049c <+1228>:  add     x5, x5, x12
   0x0000007db0bc04a0 <+1232>:  cbz     x11, 0x7db0bc0500 <cgemm_kernel_r+1328>
   0x0000007db0bc04a4 <+1236>:  fmov    s1, wzr
   0x0000007db0bc04a8 <+1240>:  cmp     x2, xzr
   0x0000007db0bc04ac <+1244>:  fmov    s2, s1
   0x0000007db0bc04b0 <+1248>:  b.le    0x7db0bc04e8 <cgemm_kernel_r+1304>
   0x0000007db0bc04b4 <+1252>:  fmov    s2, s1
   0x0000007db0bc04b8 <+1256>:  mov     x0, #0x0                        // #0
   0x0000007db0bc04bc <+1260>:  ldp     s4, s5, [x3]
   0x0000007db0bc04c0 <+1264>:  add     x0, x0, #0x1
   0x0000007db0bc04c4 <+1268>:  ldp     s6, s3, [x4]
   0x0000007db0bc04c8 <+1272>:  cmp     x2, x0
   0x0000007db0bc04cc <+1276>:  add     x3, x3, #0x8
   0x0000007db0bc04d0 <+1280>:  add     x4, x4, #0x8
   0x0000007db0bc04d4 <+1284>:  fmadd   s2, s4, s6, s2
   0x0000007db0bc04d8 <+1288>:  fmadd   s1, s6, s5, s1
   0x0000007db0bc04dc <+1292>:  fmadd   s2, s5, s3, s2
   0x0000007db0bc04e0 <+1296>:  fmsub   s1, s4, s3, s1
   0x0000007db0bc04e4 <+1300>:  b.ne    0x7db0bc04bc <cgemm_kernel_r+1260>
   0x0000007db0bc04e8 <+1304>:  ldp     s3, s4, [x5]
   0x0000007db0bc04ec <+1308>:  fmadd   s4, s0, s1, s4
   0x0000007db0bc04f0 <+1312>:  fmadd   s0, s0, s2, s3
   0x0000007db0bc04f4 <+1316>:  fmadd   s2, s24, s2, s4
   0x0000007db0bc04f8 <+1320>:  fmsub   s1, s24, s1, s0
   0x0000007db0bc04fc <+1324>:  stp     s1, s2, [x5]
   0x0000007db0bc0500 <+1328>:  mov     w0, #0x0                        // #0
   0x0000007db0bc0504 <+1332>:  ldp     d8, d9, [sp,#80]
   0x0000007db0bc0508 <+1336>:  ldp     x19, x20, [sp,#16]
   0x0000007db0bc050c <+1340>:  ldp     d10, d11, [sp,#96]
   0x0000007db0bc0510 <+1344>:  ldp     x21, x22, [sp,#32]
   0x0000007db0bc0514 <+1348>:  ldp     d12, d13, [sp,#112]
   0x0000007db0bc0518 <+1352>:  ldp     x23, x24, [sp,#48]
   0x0000007db0bc051c <+1356>:  ldp     d14, d15, [sp,#128]
   0x0000007db0bc0520 <+1360>:  ldp     x25, x26, [sp,#64]
   0x0000007db0bc0524 <+1364>:  ldp     x29, x30, [sp],#144
   0x0000007db0bc0528 <+1368>:  ret
   0x0000007db0bc052c <+1372>:  fmov    s7, s18
   0x0000007db0bc0530 <+1376>:  mov     x6, x15
   0x0000007db0bc0534 <+1380>:  fmov    s6, s18
   0x0000007db0bc0538 <+1384>:  fmov    s5, s18
   0x0000007db0bc053c <+1388>:  fmov    s28, s18
   0x0000007db0bc0540 <+1392>:  fmov    s4, s18
   0x0000007db0bc0544 <+1396>:  fmov    s3, s18
   0x0000007db0bc0548 <+1400>:  fmov    s2, s18
   0x0000007db0bc054c <+1404>:  b       0x7db0bc0220 <cgemm_kernel_r+592>
   0x0000007db0bc0550 <+1408>:  fmov    s5, s1
   0x0000007db0bc0554 <+1412>:  fmov    s4, s1
   0x0000007db0bc0558 <+1416>:  fmov    s3, s1
   0x0000007db0bc055c <+1420>:  b       0x7db0bc045c <cgemm_kernel_r+1164>
   0x0000007db0bc0560 <+1424>:  mov     x8, x3
   0x0000007db0bc0564 <+1428>:  mov     x9, x1
   0x0000007db0bc0568 <+1432>:  b       0x7db0bc0310 <cgemm_kernel_r+832>
   0x0000007db0bc056c <+1436>:  fmov    s6, s3
   0x0000007db0bc0570 <+1440>:  fmov    s5, s3
   0x0000007db0bc0574 <+1444>:  fmov    s4, s3
   0x0000007db0bc0578 <+1448>:  b       0x7db0bc0378 <cgemm_kernel_r+936>
End of assembler dump.
@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 4, 2016

OK, c repro here

#include <cblas.h>
#include <stdio.h>

int main()
{
    const float A[2] = {0.4713259, 0.14339028};
    const float B[2] = {0.4713259, 0.14339028};
    float C[2];
    const float alpha[2] = {1, 0};
    const float beta[2] = {0, 0};
    cblas_cgemm(CblasRowMajor, CblasNoTrans, CblasConjTrans,
                1, 1, 1, alpha, A, 1, B, 1, beta, C, 1);
    printf("%g, %g\n", C[0], C[1]);
    cblas_cgemm(CblasColMajor, CblasNoTrans, CblasConjTrans,
                1, 1, 1, alpha, A, 1, B, 1, beta, C, 1);
    printf("%g, %g\n", C[0], C[1]);
    return 0;
}

The output on AArch64 is

0.242709, 1.28015e-09
0.242709, -1.28015e-09

and on both x64 and AArch32 its

0.242709, 0
0.242709, 0

@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 4, 2016

The relevant assembly generated is

ldp     s4, s5, [x3]
ldp     s6, s3, [x4]
fmadd   s2, s4, s6, s2
fmsub   s1, s6, s5, s1
fmadd   s2, s5, s3, s2
fmadd   s1, s4, s3, s1

Where x3 and x4 are the addresses of the two complex numbers (so s4 and s6 are the real part, s5, s3 are the imaginary part). This looks valid logically but the issue is that fma(s5, s6, -s5 * s6) is not zero.

julia> s5 = 0.14339028f0
0.14339028f0

julia> s6 = 0.471325904f0
0.4713259f0

julia> fma(s5, s6, -s5 * s6)
1.2801453f-9

I guess the solution of this problem could be not passing certain flag that allows the compiler to emit fma instructions? (Or maybe use manually optimized kernels that doesn't have this problem.....)

@buffer51
Copy link
Contributor

buffer51 commented Feb 5, 2016

Hi @yuyichao,

What is the processor you're testing? How did you build OpenBLAS?

AArch64 has had several patches for incorrect numeric results in GNU tools, could be that you're not enabling the patches required. For instance the Android cross-compiler didn't have those as of few months ago.

@yuyichao
Copy link
Contributor Author

yuyichao commented Feb 5, 2016

The system is ArchLinux ARM. The PKGBUILD I used to compile openblas can be found here.

The hardware is Jetson TX1.

@xianyi
Copy link
Collaborator

xianyi commented Feb 11, 2016

I cannot control the instructions for generic kernels. Do you try the latest develop branch? I think it should be assembly kernel for Cortex-A57.

@yuyichao
Copy link
Contributor Author

Do I need to enable the cortex-a57 kernel manually? With the develop branch the problem is still there.

@xianyi
Copy link
Collaborator

xianyi commented Feb 12, 2016

What's the name of library? Is it libopenblas_armv8 or libopenblas_cortexa57? If it is armv8, you need make TARGET=CORTEXA57.

@yuyichao
Copy link
Contributor Author

It's armv8. Seems that the README is out of date. I'll test passing appropriate make flag later.

@yuyichao
Copy link
Contributor Author

Not really, both the original test (5x5 matrix) and the reduced test (1x1 matrix) are still failing.

@xianyi
Copy link
Collaborator

xianyi commented Feb 23, 2016

In Cortex-A57 kernel, OpenBLAS already uses fmla and fmls to compute the result as following.

  0x40d724 <cgemm_kernel_L1_M1_42+16>: fmla    s16, s0, v8.s[0]
   0x40d728 <cgemm_kernel_L1_M1_42+20>: fmla    s16, s1, v9.s[0]
   0x40d72c <cgemm_kernel_L1_M1_42+24>: fmla    s17, s0, v9.s[0]
   0x40d730 <cgemm_kernel_L1_M1_42+28>: fmls    s17, s1, v8.s[0]

The s16 is for real result, s17 is for imaginary result.

Before fmls,

p $s17
$15 = {f = 0.0675835535, u = 1032481087, s = 1032481087}

After fmls,

 p $s17
$17 = {f = -1.28014532e-09, u = 2964320540, s = -1330646756}

I have no idea how to fix it. Hardware limit?

@xianyi
Copy link
Collaborator

xianyi commented Mar 1, 2016

According to the feedback from the ARM engineer, the result-1.28014532e-09 is correct and < EPS .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants