Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when inverting matrix containing nans #723

Closed
AllinCottrell opened this issue Dec 23, 2015 · 12 comments
Closed

segfault when inverting matrix containing nans #723

AllinCottrell opened this issue Dec 23, 2015 · 12 comments

Comments

@AllinCottrell
Copy link

On x86_64 Linux, using 0.2.16.dev, I'm getting a crash when using dgretrf/dgetri to
invert a general (square) matrix with nan values. Obviously one should not pass
such a matrix to such a function, but this can happen inadvertently and BLAS really
shouldn't crash. The regular netlib doesn't crash. I'm attaching a test program that
demonstrates the issue. I'm also attaching the valgrind log from the crash.

valgrind.log.txt
blascrash.c.txt

@martin-frbg
Copy link
Collaborator

Probably same issue as #671 - with NaN elements, the function that computes the pivot point returns "some array element number" in netlib, while the machine-optimized version in OpenBLAS currently returns "some number" which can be outside the bounds of the array. As a stopgap solution for #671, the returned value was clamped in the caller, while the general solution needs to be in the called function. (Though using nan values arguably takes you into the realm of "undefined behaviour", which may include crashes, flying pigs or any combination thereof)

@AllinCottrell
Copy link
Author

Ah, thanks! I just noticed #671 myself. I'll try clamping the ipiv values from dgetrf.

As for undefined behaviour, I don't think that should be the case. According to the netlib LAPACK FAQ one should be able to expect IEEE-754 behavior: "LAPACK, version 3.0, introduced new routines which rely on IEEE-754 compliance. [...] As a result, two settings were added to LAPACK/SRC/ilaenv.f to denote IEEE-754 compliance for NaN and infinity arithmetic, respectively. By default, ILAENV assumes an IEEE machine [...]"

@martin-frbg
Copy link
Collaborator

That may be their goal, but see #642 and associated netlib bug - probably not everything in LAPACK is NaN-safe yet.

@AllinCottrell
Copy link
Author

Granted. And it's true that the pivot order is undefined if the matrix contains NaNs. I wouldn't object if dgetrf flagged an error. However, I think that dgetrf returning out-of-bounds values in the pivot array is a bug. Many of us expect OpenBLAS to work like netlib lapack/blas (but much faster -- thanks, developers!), so if "info" is clear when dgetrf returns, we don't expect a subsequent call to dgetri to destroy the heap and crash our application.

@brada4
Copy link
Contributor

brada4 commented Dec 29, 2015

Does not crash with netlib lapack + openblas blas (pthread,omp,single)
The crash is in dswap routine which does not care about numeric values.

@brada4
Copy link
Contributor

brada4 commented Dec 29, 2015

Something corrupts one of pointers before invert_general_matrix (testcase.c:54)
Can you print all pointers before and after each dgetr?_ call ?

(relevant part of helgrind output)
==78412== Process terminating with default action of signal 11 (SIGSEGV)
==78412== Access not within mapped region at address 0x2805B286E0
==78412== at 0x606B5ED: dswap_k_SANDYBRIDGE (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x5182EA6: legacy_exec (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x518364F: exec_blas (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x5183BD2: blas_level1_thread (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x4F6FAA1: dswap_ (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x690FB82: dgetri_ (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x400B0A: invert_general_matrix (testcase.c:54)
==78412== by 0x400C3C: main (testcase.c:83)
==78412== If you believe this happened as a result of a stack
==78412== overflow in your program's main thread (unlikely but
==78412== possible), you can try to increase the size of the
==78412== main thread stack using the --main-stacksize= flag.
==78412== The main thread stack size used in this run was 8388608.

@AllinCottrell
Copy link
Author

In my testing, there's no problem with the pointers before the second call to dgetri (the first call is just a workspace query, but on the second call dgetri tries to make use of the invalid ipiv array generated by dgetrf). I'm attaching a revised version of my test program, and its output.
blascrash2.c.txt
(The C file is given a ".txt" suffix because GitHub won't accept *.c files!)
output.txt

@brada4
Copy link
Contributor

brada4 commented Dec 30, 2015

Maybe i am wrong, but seems to fail to read 2nd matrix (bt, disas /0x.... i.e search crash address in gdb)
0x00007ffff69363c8 <+456>: test $0x2,%rdi
0x00007ffff69363cf <+463>: jle 0x7ffff69363f0 <dswap_k_ATOM+496>
0x00007ffff69363d1 <+465>: movaps -0x80(%rcx),%xmm0
=> 0x00007ffff69363d5 <+469>: movaps -0x80(%r9),%xmm1
0x00007ffff69363da <+474>: movaps %xmm0,-0x80(%r9)
0x00007ffff69363df <+479>: movaps %xmm1,-0x80(%rcx)
0x00007ffff69363e3 <+483>: add $0x10,%rcx

@martin-frbg
Copy link
Collaborator

To spell it out, the second element of the ipiv array generated by dgetrf is 11 while the dimension of the matrix is 10 - this is the similarity to #671. A simple
for (int ii=0;ii<dim;ii++) if (ipiv[ii]<0 || ipiv[ii]>dim) ipiv[ii]=dim;
inserted between the calls to dgetrf and dgetri in the testcase makes the crash go away. But this is just papering over part of the bug, where ipiv[2] is outside the bounds already when used within dgetf2_k - the actual cause appears to be (like idamax in #671) the out-of-bounds result of IAMAX_K as called in line 96 of lapack/getf2/getf2_k.c.
If I put an if (jp>n) jp=n; there before the assignment to ipiv[j+offset], all the valgrind warnings (from accesses beyond the end of the array in the next iteration of the outer loop) go away.
I suspect 5a29160 did not address this particular code path, as all it does is clamp MAX_K as
obtained in interface/imax.c, which does not appear to figure in the sequence of events here.

@wernsaar
Copy link
Contributor

Hi,

I wrote the following test code:

#include <stdlib.h>
#include <stdio.h>

extern int idamax_( int *N, double *x, int *INC_X);

main()
{

int n=6;
int inc_x =1;
double x[6] = { 0.1, 0.2, 0.3, 0.2, 0.1, 0.0 };
int i;
int j;
i = idamax_(&n, x, &inc_x);
printf("Normal                   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=1.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With INF at position 4   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 4   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=0.2;
x[0]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 1   :\t\t%d\t%f\n",i,x[i-1]);

x[0]=0.1;
x[1]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 2   :\t\t%d\t%f\n",i,x[i-1]);

for ( j=0; j<n; j++)
    x[j]=0.0/0.0;   

i = idamax_(&n, x, &inc_x);
printf("With NAN at all positions:\t\t%d\t%f\n",i,x[i-1]);

}

The results are:

Linked with refblas:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 3 0.300000
With NAN at all positions: 1 -nan

Linked with OpenBLAS generic kernel:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 3 0.300000
With NAN at all positions: 1 -nan

Linked with OpenBLAS optimized kernel:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 2 -nan
With NAN at all positions: 1 -nan

The optimized kernel, where idamax is written in assembly,
computes a wrong result, if a nan appears before the max value.

Best regards
Werner

@xianyi
Copy link
Collaborator

xianyi commented Jan 26, 2016

@martin-frbg , you are right. I just modified interface/imax.c and forgot to modify other function which directly use i?amax kernels.

@martin-frbg
Copy link
Collaborator

Glad to know that my comment was not completely wrong. Once the assembler code is improved, none of these guards should be strictly necessary anymore, at least three tickets can be closed for good (if I counted correctly), and maybe I can even learn a bit more about x86 assembler from comparing the old and new versions then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants