Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ReLAPACK #2066

Closed
Jellby opened this issue Mar 21, 2019 · 18 comments
Closed

Crash in ReLAPACK #2066

Jellby opened this issue Mar 21, 2019 · 18 comments

Comments

@Jellby
Copy link
Contributor

Jellby commented Mar 21, 2019

I tried to use OpenBLAS with ReLAPACK in OpenMolcas (https://gitlab.com/Molcas/OpenMolcas) and got a crash, apparently in RELAPACK_dgetrf_rec at dgetrf.c.

I compiled OpenBLAS with:

make USE_OPENMP=1 INTERFACE64=1 NO_CBLAS=1 BUILD_RELAPACK=1 LIBPREFIX=libopenblas_i8

and the crash occurs, for example, when I run:

pymolcas verify 036

Without ReLAPACK, I don't see the problem.

What other information can I provide or how can I debug it further?

@martin-frbg
Copy link
Collaborator

You could try building OpenBLAS with debug information (setting DEBUG=1, or adding -g to the compile flags) to get more detailed line information. (I assume your "crash" is a segmentation fault, or is it just that the calculation goes haywire ?) Which version of OpenBLAS did you try (not that much happened within ReLAPACK recently, but the problem could have started elsewhere in the code) ?

@Jellby
Copy link
Contributor Author

Jellby commented Mar 21, 2019

This is with v0.3.5. A crash is indeed a segmentation fault. With DEBUG=1 I get this stack trace:

symbolized stack trace:
    #0 (?) /Programs/OpenBLAS/libopenblas_i8_nehalemp-r0.3.5.so 0x3ab4d9
    #1 (?) /Programs/OpenBLAS/libopenblas_i8_nehalemp-r0.3.5.so 0x8ca98
    #2 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:67
    #3 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #4 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #5 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #6 RELAPACK_dgetrf_rec at /Programs/OpenBLAS/relapack/src/dgetrf.c:98 (discriminator 3)
    #7 RELAPACK_dgetrf at /Programs/OpenBLAS/relapack/src/dgetrf.c:35
    #8 dgetrf_ at /Programs/OpenBLAS/relapack/src/lapack_wrappers.c:373
    #9 xdr_dmatinv_ at /Programs/OpenMolcas/src/dkh_util/xdr_dmatinv.f:28
    #10 xdr_fpfw_ at /Programs/OpenMolcas/src/dkh_util/xdr_fpfw.f:97
    #11 dkh_ts1e_ at /Programs/OpenMolcas/src/dkh_util/dkh_ts1e.f:45
    #12 xdr_ham_ at /Programs/OpenMolcas/src/dkh_util/xdr_ham.f:131
    #13 dkrelint_dp_ at /Programs/OpenMolcas/src/dkh_util/dkrelint_dp.f:366
    #14 drv1el_ at /Programs/OpenMolcas/src/seward/drv1el.f:1894
    #15 seward_ at /Programs/OpenMolcas/src/seward/seward.f:317
    #16 MAIN__ at /Programs/OpenMolcas/src/seward/main.f:23
    #17 __libc_start_main at /build/eglibc-ripdx6/eglibc-2.19/csu/libc-start.c:287
    #18 (?) /Programs/OpenMolcas/bounds/bin/seward.exe 0x405f51

ETA: with CBLAS, the first two lines are:

    #0 dgetf2_k at /Programs/OpenBLAS/lapack/getf2/getf2_k.c:83
    #1 dgetf2_ at /Programs/OpenBLAS/interface/lapack/getf2.c:96

@martin-frbg
Copy link
Collaborator

Not sure what to make of this - dgetrf.c line 67 is where it forwards the call to stock LAPACK dgetf2 on the assumption that the problem size is too small to make the recursive block approach worthwile. I believe dgetf2 would complain if n actually managed to become zero or negative but this may need confirmation.
Thanks for the update - from the git history it appears getf2_k already received a patch for some out-of-bounds mischief in #723
From the backtrace it looks like you are running this on Windows ? (Not that it matters except for ease of debugging)

@Jellby
Copy link
Contributor Author

Jellby commented Mar 21, 2019

Nope, not Windows, it's Ubuntu 14.04.

@martin-frbg
Copy link
Collaborator

At first glance the kludge from #723 should keep it from doing any accesses beyond the end of the array at line 83 - as long as the jp value from line 97 remains positive, but if it did not, it should have crashed at line 100 that has basically the same assignment. Could you try to obtain the values in ipiv at the time of failure, by running your program from gdb ?

@Jellby
Copy link
Contributor Author

Jellby commented Mar 21, 2019

I don't know if I'm doing it right, but with gdb I get:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe1d2aa6d in dgetf2_k (args=0x7fffffff38f0, range_m=0x0, range_n=0x0, sa=0x7fffdca08020, sb=0x7fffdcb04020, myid=0) at getf2_k.c:82
82              temp1 = *(b + i);
(gdb) p args[0]   
$1 = {a = 0x60360000fd80, b = 0x0, c = 0x601000004fa0, d = 0x7fffe4433898, alpha = 0x10007fff6774, beta = 0x7fffffff3a70, m = 8, n = 140733193388040, k = 140737488323920, lda = 8, ldb = 105785044565376, 
  ldc = 140737488304872, ldd = 140737488323920, common = 0x601000004fa0, nthreads = 140737488305088}
(gdb) x/16 ipiv
0x601000004fa0: 0x00000008      0x00000000      0x00000007      0x00000000
0x601000004fb0: 0x00000006      0x00000000      0x00000005      0x00000000
0x601000004fc0: 0x00000005      0x00000000      0x00000006      0x00000000
0x601000004fd0: 0x00000007      0x00000000      0x00000008      0x00000000

I guess ipiv looks right, but n doesn't.

@martin-frbg
Copy link
Collaborator

Yes, n looks quite unhealthy. Wonder where this happened, perhaps this is still visible when you go up the call tree in gdb and print the args at each level...

@Jellby
Copy link
Contributor Author

Jellby commented Mar 21, 2019

Does this help?

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe1d2aa6d in dgetf2_k (args=0x7fffffff3910, range_m=0x0, range_n=0x0, sa=0x7fff9d596020, sb=0x7fff9d692020, myid=0) at getf2_k.c:82
82              temp1 = *(b + i);
(gdb) p args.n
$1 = 140733193388040
(gdb) up
#1  0x00007fffe18710cd in dgetf2_ (M=0x7fffffff8570, N=0x7fffffff3b08, a=0x60360000a580, ldA=0x7fffffff8570, ipiv=0x601000005aa0, Info=0x7fffffff3be0) at lapack/getf2.c:96
96        info = GETF2(&args, NULL, NULL, sa, sb, 0);
(gdb) p *N
$2 = 140733193388040
(gdb) up
#2  0x00007fffe1870482 in RELAPACK_dgetrf_rec (m=0x7fffffff8570, n=0x7fffffff3b08, A=0x60360000a580, ldA=0x7fffffff8570, ipiv=0x601000005aa0, info=0x7fffffff3be0) at src/dgetrf.c:67
67              LAPACK(dgetf2)(m, n, A, ldA, ipiv, info);
(gdb) p *n
$3 = 8

Maybe I should try with a newer compiler (gcc 4.8.5 at the moment).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 21, 2019

I have a nagging feeling that this is related to the INTERFACE64=1 build - not sure if I thought/knew to make ReLAPACK compatible with this when I added it to the build some two years ago, and int/long argument mismatch might explain the astonishing growth in the value of n.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 21, 2019

Unfortunately I still see a segfault in your test 036 even with a quick hack for the (assumed) INTERFACE64 problem, and it is not clear to me how/where to invoke gdb in your pymolcas
tool
. Surely it is possible to invoke seward.exe directly with the appropriate arguments ?
EDIT: nevermind, seems calling seward.exe in the test directory created by pymolcas does it.
Still have not found out where n gets trashed however.

@Jellby
Copy link
Contributor Author

Jellby commented Mar 22, 2019

The good thing is you could reproduce it. It is possible to run seward.exe directly, but it would need setting up the appropriate files and environment first. You should be able to run a debugger with:

MOLCAS_DEBUGGER=gdb pymolcas verify 036 -d

but note that it will first run gateway.exe before going to seward.exe. It will probably be easier to create a file (say H.input) like this:

&SEWARD
  Coord = 1
    bohr
    H 0.0 0.0 0.0
  Basis = ANO-RCC-MB
  Group = NoSym

and then run:

MOLCAS_DEBUGGER=gdb WorkDir=. pymolcas H.input

(It will write some scratch files in the current directory, be warned.)

@brada4
Copy link
Contributor

brada4 commented Mar 22, 2019

You can try ltrace before debugger, it usually mixes up arguments anyway, but we might get lucky.

@brada4
Copy link
Contributor

brada4 commented Mar 22, 2019

I dont understand offset calculation above
/Programs/OpenBLAS/lapack/getf2/getf2_k.c:83

@martin-frbg
Copy link
Collaborator

Similar symptoms can be seen with the LAPACK tests actually, although ReLAPACK still passes its own tests. (I can only assume that back when I created the PR to merge ReLAPACK, the OpenBLAS build of lapack/TESTING was incomplete and/or I was not aware of its importance.) There are several spots in the code (notably xPBTRF) where local work arrays are allocated based on runtime parameters that may (legitimately?) become negative depending on input. Not sure yet if fixing these will solve all problems, but at least this appears to be one cause of stack corruption.

@martin-frbg
Copy link
Collaborator

Down to

grayzone:067 Failed! (seward)         
grayzone:068 Failed! (seward)       
grayzone:139 Failed! (cpf)          
grayzone:395 Failed! (rasscf)       
grayzone:405 Failed! (seward)       
grayzone:803 Failed! (seward)       
grayzone:804 Failed! (scf)          
grayzone:820 Failed! (scf)          
grayzone:834 Skipped!               
grayzone:850 Skipped!               
grayzone:851 Skipped!               
grayzone:861 Skipped!                
************************************************************************
A total of 8 test(s) failed, with 0 critical failure(s).

now

@Jellby
Copy link
Contributor Author

Jellby commented Apr 29, 2019

That looks good, "grayzone" tests are not expected to pass.

@martin-frbg
Copy link
Collaborator

Great. I have now released 0.3.6 with the fixes. (The ReLAPACK build still shows some errors in the LAPACK testsuite as follows:

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1205347         1913    (0.159%)        1994    (0.165%)
DOUBLE PRECISION        1213959         2883    (0.237%)        2588    (0.213%)
COMPLEX                 663996          3090    (0.465%)        3071    (0.463%)
COMPLEX16               660722          3386    (0.512%)        3098    (0.469%)

--> ALL PRECISIONS      3744024         11272   (0.301%)        10751   (0.287%)

compared to

SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1284869         1       (0.000%)        1       (0.000%)
DOUBLE PRECISION        1293457         0       (0.000%)        1       (0.000%)
COMPLEX                 745040          1       (0.000%)        2       (0.000%)
COMPLEX16               753628          0       (0.000%)        2       (0.000%)

--> ALL PRECISIONS      4076994         2       (0.000%)        6       (0.000%)

for a build with Reference-LAPACK from netlib, but I expect different rounding will be a factor)

@brada4
Copy link
Contributor

brada4 commented Apr 29, 2019

The more operations are done per point the more (rounding) error is accumulated.
Esp if MKL and ATLAS land similar bias....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants