Crash in ReLAPACK #2066

Jellby · 2019-03-21T08:45:32Z

I tried to use OpenBLAS with ReLAPACK in OpenMolcas (https://gitlab.com/Molcas/OpenMolcas) and got a crash, apparently in RELAPACK_dgetrf_rec at dgetrf.c.

I compiled OpenBLAS with:

make USE_OPENMP=1 INTERFACE64=1 NO_CBLAS=1 BUILD_RELAPACK=1 LIBPREFIX=libopenblas_i8

and the crash occurs, for example, when I run:

pymolcas verify 036

Without ReLAPACK, I don't see the problem.

What other information can I provide or how can I debug it further?

The text was updated successfully, but these errors were encountered:

martin-frbg · 2019-03-21T09:01:28Z

You could try building OpenBLAS with debug information (setting DEBUG=1, or adding -g to the compile flags) to get more detailed line information. (I assume your "crash" is a segmentation fault, or is it just that the calculation goes haywire ?) Which version of OpenBLAS did you try (not that much happened within ReLAPACK recently, but the problem could have started elsewhere in the code) ?

Jellby · 2019-03-21T09:48:40Z

This is with v0.3.5. A crash is indeed a segmentation fault. With DEBUG=1 I get this stack trace:

symbolized stack trace:
    #0 (?) /Programs/OpenBLAS/libopenblas_i8_nehalemp-r0.3.5.so 0x3ab4d9
    #1 (?) /Programs/OpenBLAS/libopenblas_i8_nehalemp-r0.3.5.so 0x8ca98
    #2 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:67
    #3 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #4 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #5 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #6 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #7 RELAPACK_dgetrf at /Programs/OpenBLAS/relapack/src/dgetrf.c:35
    #8 dgetrf_ at /Programs/OpenBLAS/relapack/src/lapack_wrappers.c:373
    #9 xdr_dmatinv_ at /Programs/OpenMolcas/src/dkh_util/xdr_dmatinv.f:28
    #10 xdr_fpfw_ at /Programs/OpenMolcas/src/dkh_util/xdr_fpfw.f:97
    #11 dkh_ts1e_ at /Programs/OpenMolcas/src/dkh_util/dkh_ts1e.f:45
    #12 xdr_ham_ at /Programs/OpenMolcas/src/dkh_util/xdr_ham.f:131
    #13 dkrelint_dp_ at /Programs/OpenMolcas/src/dkh_util/dkrelint_dp.f:366
    #14 drv1el_ at /Programs/OpenMolcas/src/seward/drv1el.f:1894
    #15 seward_ at /Programs/OpenMolcas/src/seward/seward.f:317
    #16 MAIN__ at /Programs/OpenMolcas/src/seward/main.f:23
    #17 __libc_start_main at /build/eglibc-ripdx6/eglibc-2.19/csu/libc-start.c:287
    #18 (?) /Programs/OpenMolcas/bounds/bin/seward.exe 0x405f51

ETA: with CBLAS, the first two lines are:

    #0 dgetf2_k at /Programs/OpenBLAS/lapack/getf2/getf2_k.c:83
    #1 dgetf2_ at /Programs/OpenBLAS/interface/lapack/getf2.c:96

martin-frbg · 2019-03-21T10:07:01Z

Not sure what to make of this - dgetrf.c line 67 is where it forwards the call to stock LAPACK dgetf2 on the assumption that the problem size is too small to make the recursive block approach worthwile. I believe dgetf2 would complain if n actually managed to become zero or negative but this may need confirmation.
Thanks for the update - from the git history it appears getf2_k already received a patch for some out-of-bounds mischief in #723
From the backtrace it looks like you are running this on Windows ? (Not that it matters except for ease of debugging)

Jellby · 2019-03-21T10:26:20Z

Nope, not Windows, it's Ubuntu 14.04.

martin-frbg · 2019-03-21T10:50:09Z

At first glance the kludge from #723 should keep it from doing any accesses beyond the end of the array at line 83 - as long as the jp value from line 97 remains positive, but if it did not, it should have crashed at line 100 that has basically the same assignment. Could you try to obtain the values in ipiv at the time of failure, by running your program from gdb ?

Jellby · 2019-03-21T13:20:20Z

I don't know if I'm doing it right, but with gdb I get:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe1d2aa6d in dgetf2_k (args=0x7fffffff38f0, range_m=0x0, range_n=0x0, sa=0x7fffdca08020, sb=0x7fffdcb04020, myid=0) at getf2_k.c:82
82              temp1 = *(b + i);
(gdb) p args[0]   
$1 = {a = 0x60360000fd80, b = 0x0, c = 0x601000004fa0, d = 0x7fffe4433898, alpha = 0x10007fff6774, beta = 0x7fffffff3a70, m = 8, n = 140733193388040, k = 140737488323920, lda = 8, ldb = 105785044565376, 
  ldc = 140737488304872, ldd = 140737488323920, common = 0x601000004fa0, nthreads = 140737488305088}
(gdb) x/16 ipiv
0x601000004fa0: 0x00000008      0x00000000      0x00000007      0x00000000
0x601000004fb0: 0x00000006      0x00000000      0x00000005      0x00000000
0x601000004fc0: 0x00000005      0x00000000      0x00000006      0x00000000
0x601000004fd0: 0x00000007      0x00000000      0x00000008      0x00000000

I guess ipiv looks right, but n doesn't.

martin-frbg · 2019-03-21T14:12:28Z

Yes, n looks quite unhealthy. Wonder where this happened, perhaps this is still visible when you go up the call tree in gdb and print the args at each level...

Jellby · 2019-03-21T14:51:18Z

Does this help?

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe1d2aa6d in dgetf2_k (args=0x7fffffff3910, range_m=0x0, range_n=0x0, sa=0x7fff9d596020, sb=0x7fff9d692020, myid=0) at getf2_k.c:82
82              temp1 = *(b + i);
(gdb) p args.n
$1 = 140733193388040
(gdb) up
#1  0x00007fffe18710cd in dgetf2_ (M=0x7fffffff8570, N=0x7fffffff3b08, a=0x60360000a580, ldA=0x7fffffff8570, ipiv=0x601000005aa0, Info=0x7fffffff3be0) at lapack/getf2.c:96
96        info = GETF2(&args, NULL, NULL, sa, sb, 0);
(gdb) p *N
$2 = 140733193388040
(gdb) up
#2  0x00007fffe1870482 in RELAPACK_dgetrf_rec (m=0x7fffffff8570, n=0x7fffffff3b08, A=0x60360000a580, ldA=0x7fffffff8570, ipiv=0x601000005aa0, info=0x7fffffff3be0) at src/dgetrf.c:67
67              LAPACK(dgetf2)(m, n, A, ldA, ipiv, info);
(gdb) p *n
$3 = 8

Maybe I should try with a newer compiler (gcc 4.8.5 at the moment).

martin-frbg · 2019-03-21T15:22:07Z

I have a nagging feeling that this is related to the INTERFACE64=1 build - not sure if I thought/knew to make ReLAPACK compatible with this when I added it to the build some two years ago, and int/long argument mismatch might explain the astonishing growth in the value of n.

martin-frbg · 2019-03-21T20:51:34Z

Unfortunately I still see a segfault in your test 036 even with a quick hack for the (assumed) INTERFACE64 problem, and it is not clear to me how/where to invoke gdb in your pymolcas
tool. Surely it is possible to invoke seward.exe directly with the appropriate arguments ?
EDIT: nevermind, seems calling seward.exe in the test directory created by pymolcas does it.
Still have not found out where n gets trashed however.

Jellby · 2019-03-22T08:23:02Z

The good thing is you could reproduce it. It is possible to run seward.exe directly, but it would need setting up the appropriate files and environment first. You should be able to run a debugger with:

MOLCAS_DEBUGGER=gdb pymolcas verify 036 -d

but note that it will first run gateway.exe before going to seward.exe. It will probably be easier to create a file (say H.input) like this:

&SEWARD
  Coord = 1
    bohr
    H 0.0 0.0 0.0
  Basis = ANO-RCC-MB
  Group = NoSym

and then run:

MOLCAS_DEBUGGER=gdb WorkDir=. pymolcas H.input

(It will write some scratch files in the current directory, be warned.)

brada4 · 2019-03-22T18:39:47Z

You can try ltrace before debugger, it usually mixes up arguments anyway, but we might get lucky.

brada4 · 2019-03-22T19:53:30Z

I dont understand offset calculation above
/Programs/OpenBLAS/lapack/getf2/getf2_k.c:83

martin-frbg · 2019-04-14T20:34:01Z

Similar symptoms can be seen with the LAPACK tests actually, although ReLAPACK still passes its own tests. (I can only assume that back when I created the PR to merge ReLAPACK, the OpenBLAS build of lapack/TESTING was incomplete and/or I was not aware of its importance.) There are several spots in the code (notably xPBTRF) where local work arrays are allocated based on runtime parameters that may (legitimately?) become negative depending on input. Not sure yet if fixing these will solve all problems, but at least this appears to be one cause of stack corruption.

martin-frbg · 2019-04-28T22:26:27Z

Down to

grayzone:067 Failed! (seward)         
grayzone:068 Failed! (seward)       
grayzone:139 Failed! (cpf)          
grayzone:395 Failed! (rasscf)       
grayzone:405 Failed! (seward)       
grayzone:803 Failed! (seward)       
grayzone:804 Failed! (scf)          
grayzone:820 Failed! (scf)          
grayzone:834 Skipped!               
grayzone:850 Skipped!               
grayzone:851 Skipped!               
grayzone:861 Skipped!                
************************************************************************
A total of 8 test(s) failed, with 0 critical failure(s).

now

Jellby · 2019-04-29T06:58:00Z

That looks good, "grayzone" tests are not expected to pass.

martin-frbg · 2019-04-29T18:00:35Z

Great. I have now released 0.3.6 with the fixes. (The ReLAPACK build still shows some errors in the LAPACK testsuite as follows:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1205347         1913    (0.159%)        1994    (0.165%)
DOUBLE PRECISION        1213959         2883    (0.237%)        2588    (0.213%)
COMPLEX                 663996          3090    (0.465%)        3071    (0.463%)
COMPLEX16               660722          3386    (0.512%)        3098    (0.469%)

--> ALL PRECISIONS      3744024         11272   (0.301%)        10751   (0.287%)

compared to

SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1284869         1       (0.000%)        1       (0.000%)
DOUBLE PRECISION        1293457         0       (0.000%)        1       (0.000%)
COMPLEX                 745040          1       (0.000%)        2       (0.000%)
COMPLEX16               753628          0       (0.000%)        2       (0.000%)

--> ALL PRECISIONS      4076994         2       (0.000%)        6       (0.000%)

for a build with Reference-LAPACK from netlib, but I expect different rounding will be a factor)

brada4 · 2019-04-29T19:28:25Z

The more operations are done per point the more (rounding) error is accumulated.
Esp if MKL and ATLAS land similar bias....

martin-frbg closed this as completed Mar 21, 2019

martin-frbg reopened this Mar 21, 2019

martin-frbg mentioned this issue Apr 23, 2019

Fix ReLAPACK compilation with INTERFACE64 #2093

Closed

martin-frbg closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in ReLAPACK #2066

Crash in ReLAPACK #2066

Jellby commented Mar 21, 2019 •

edited

Loading

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019 •

edited

Loading

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019 •

edited

Loading

martin-frbg commented Mar 21, 2019 •

edited

Loading

Jellby commented Mar 22, 2019 •

edited

Loading

brada4 commented Mar 22, 2019

brada4 commented Mar 22, 2019 •

edited

Loading

martin-frbg commented Apr 14, 2019

martin-frbg commented Apr 28, 2019

Jellby commented Apr 29, 2019

martin-frbg commented Apr 29, 2019

brada4 commented Apr 29, 2019

Crash in ReLAPACK #2066

Crash in ReLAPACK #2066

Comments

Jellby commented Mar 21, 2019 • edited Loading

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019 • edited Loading

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019

Jellby commented Mar 21, 2019

martin-frbg commented Mar 21, 2019 • edited Loading

martin-frbg commented Mar 21, 2019 • edited Loading

Jellby commented Mar 22, 2019 • edited Loading

brada4 commented Mar 22, 2019

brada4 commented Mar 22, 2019 • edited Loading

martin-frbg commented Apr 14, 2019

martin-frbg commented Apr 28, 2019

Jellby commented Apr 29, 2019

martin-frbg commented Apr 29, 2019

brada4 commented Apr 29, 2019

Jellby commented Mar 21, 2019 •

edited

Loading

Jellby commented Mar 21, 2019 •

edited

Loading

martin-frbg commented Mar 21, 2019 •

edited

Loading

martin-frbg commented Mar 21, 2019 •

edited

Loading

Jellby commented Mar 22, 2019 •

edited

Loading

brada4 commented Mar 22, 2019 •

edited

Loading