segfault when inverting matrix containing nans #723

AllinCottrell · 2015-12-23T23:36:24Z

On x86_64 Linux, using 0.2.16.dev, I'm getting a crash when using dgretrf/dgetri to
invert a general (square) matrix with nan values. Obviously one should not pass
such a matrix to such a function, but this can happen inadvertently and BLAS really
shouldn't crash. The regular netlib doesn't crash. I'm attaching a test program that
demonstrates the issue. I'm also attaching the valgrind log from the crash.

valgrind.log.txt
blascrash.c.txt

martin-frbg · 2015-12-24T14:23:29Z

Probably same issue as #671 - with NaN elements, the function that computes the pivot point returns "some array element number" in netlib, while the machine-optimized version in OpenBLAS currently returns "some number" which can be outside the bounds of the array. As a stopgap solution for #671, the returned value was clamped in the caller, while the general solution needs to be in the called function. (Though using nan values arguably takes you into the realm of "undefined behaviour", which may include crashes, flying pigs or any combination thereof)

AllinCottrell · 2015-12-24T15:00:59Z

Ah, thanks! I just noticed #671 myself. I'll try clamping the ipiv values from dgetrf.

As for undefined behaviour, I don't think that should be the case. According to the netlib LAPACK FAQ one should be able to expect IEEE-754 behavior: "LAPACK, version 3.0, introduced new routines which rely on IEEE-754 compliance. [...] As a result, two settings were added to LAPACK/SRC/ilaenv.f to denote IEEE-754 compliance for NaN and infinity arithmetic, respectively. By default, ILAENV assumes an IEEE machine [...]"

martin-frbg · 2015-12-24T16:10:14Z

That may be their goal, but see #642 and associated netlib bug - probably not everything in LAPACK is NaN-safe yet.

AllinCottrell · 2015-12-24T17:31:52Z

Granted. And it's true that the pivot order is undefined if the matrix contains NaNs. I wouldn't object if dgetrf flagged an error. However, I think that dgetrf returning out-of-bounds values in the pivot array is a bug. Many of us expect OpenBLAS to work like netlib lapack/blas (but much faster -- thanks, developers!), so if "info" is clear when dgetrf returns, we don't expect a subsequent call to dgetri to destroy the heap and crash our application.

brada4 · 2015-12-29T12:06:33Z

Does not crash with netlib lapack + openblas blas (pthread,omp,single)
The crash is in dswap routine which does not care about numeric values.

brada4 · 2015-12-29T18:58:28Z

Something corrupts one of pointers before invert_general_matrix (testcase.c:54)
Can you print all pointers before and after each dgetr?_ call ?

(relevant part of helgrind output)
==78412== Process terminating with default action of signal 11 (SIGSEGV)
==78412== Access not within mapped region at address 0x2805B286E0
==78412== at 0x606B5ED: dswap_k_SANDYBRIDGE (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x5182EA6: legacy_exec (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x518364F: exec_blas (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x5183BD2: blas_level1_thread (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x4F6FAA1: dswap_ (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x690FB82: dgetri_ (in /usr/lib64/libopenblas_pthreads.so.0)
==78412== by 0x400B0A: invert_general_matrix (testcase.c:54)
==78412== by 0x400C3C: main (testcase.c:83)
==78412== If you believe this happened as a result of a stack
==78412== overflow in your program's main thread (unlikely but
==78412== possible), you can try to increase the size of the
==78412== main thread stack using the --main-stacksize= flag.
==78412== The main thread stack size used in this run was 8388608.

AllinCottrell · 2015-12-30T00:49:02Z

In my testing, there's no problem with the pointers before the second call to dgetri (the first call is just a workspace query, but on the second call dgetri tries to make use of the invalid ipiv array generated by dgetrf). I'm attaching a revised version of my test program, and its output.
blascrash2.c.txt
(The C file is given a ".txt" suffix because GitHub won't accept *.c files!)
output.txt

brada4 · 2015-12-30T09:34:54Z

Maybe i am wrong, but seems to fail to read 2nd matrix (bt, disas /0x.... i.e search crash address in gdb)
0x00007ffff69363c8 <+456>: test $0x2,%rdi
0x00007ffff69363cf <+463>: jle 0x7ffff69363f0 <dswap_k_ATOM+496>
0x00007ffff69363d1 <+465>: movaps -0x80(%rcx),%xmm0
=> 0x00007ffff69363d5 <+469>: movaps -0x80(%r9),%xmm1
0x00007ffff69363da <+474>: movaps %xmm0,-0x80(%r9)
0x00007ffff69363df <+479>: movaps %xmm1,-0x80(%rcx)
0x00007ffff69363e3 <+483>: add $0x10,%rcx

martin-frbg · 2015-12-30T11:56:57Z

To spell it out, the second element of the ipiv array generated by dgetrf is 11 while the dimension of the matrix is 10 - this is the similarity to #671. A simple
for (int ii=0;ii<dim;ii++) if (ipiv[ii]<0 || ipiv[ii]>dim) ipiv[ii]=dim;
inserted between the calls to dgetrf and dgetri in the testcase makes the crash go away. But this is just papering over part of the bug, where ipiv[2] is outside the bounds already when used within dgetf2_k - the actual cause appears to be (like idamax in #671) the out-of-bounds result of IAMAX_K as called in line 96 of lapack/getf2/getf2_k.c.
If I put an if (jp>n) jp=n; there before the assignment to ipiv[j+offset], all the valgrind warnings (from accesses beyond the end of the array in the next iteration of the outer loop) go away.
I suspect 5a29160 did not address this particular code path, as all it does is clamp MAX_K as
obtained in interface/imax.c, which does not appear to figure in the sequence of events here.

wernsaar · 2016-01-26T12:11:37Z

Hi,

I wrote the following test code:

#include <stdlib.h>
#include <stdio.h>

extern int idamax_( int *N, double *x, int *INC_X);

main()
{

int n=6;
int inc_x =1;
double x[6] = { 0.1, 0.2, 0.3, 0.2, 0.1, 0.0 };
int i;
int j;
i = idamax_(&n, x, &inc_x);
printf("Normal                   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=1.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With INF at position 4   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 4   :\t\t%d\t%f\n",i,x[i-1]);

x[3]=0.2;
x[0]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 1   :\t\t%d\t%f\n",i,x[i-1]);

x[0]=0.1;
x[1]=0.0/0.0;   
i = idamax_(&n, x, &inc_x);
printf("With NAN at position 2   :\t\t%d\t%f\n",i,x[i-1]);

for ( j=0; j<n; j++)
    x[j]=0.0/0.0;   

i = idamax_(&n, x, &inc_x);
printf("With NAN at all positions:\t\t%d\t%f\n",i,x[i-1]);

}

The results are:

Linked with refblas:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 3 0.300000
With NAN at all positions: 1 -nan

Linked with OpenBLAS generic kernel:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 3 0.300000
With NAN at all positions: 1 -nan

Linked with OpenBLAS optimized kernel:

Normal : 3 0.300000
With INF at position 4 : 4 inf
With NAN at position 4 : 3 0.300000
With NAN at position 1 : 1 -nan
With NAN at position 2 : 2 -nan
With NAN at all positions: 1 -nan

The optimized kernel, where idamax is written in assembly,
computes a wrong result, if a nan appears before the max value.

Best regards
Werner

xianyi · 2016-01-26T14:58:15Z

@martin-frbg , you are right. I just modified interface/imax.c and forgot to modify other function which directly use i?amax kernels.

martin-frbg · 2016-01-26T15:19:42Z

Glad to know that my comment was not completely wrong. Once the assembler code is improved, none of these guards should be strictly necessary anymore, at least three tickets can be closed for good (if I counted correctly), and maybe I can even learn a bit more about x86 assembler from comparing the old and new versions then.

martin-frbg mentioned this issue Dec 30, 2015

Incorrect Idamax with NaN values #624

Closed

xianyi added a commit that referenced this issue Jan 26, 2016

Refs #723. Avoid out of boundary for getf2.

13f0f8c

xianyi closed this as completed Mar 1, 2016

martin-frbg mentioned this issue Mar 21, 2019

Crash in ReLAPACK #2066

Closed

martin-frbg mentioned this issue Jul 31, 2023

note on getf2 (unblocked LU factorization) #4174

Closed

martin-frbg added a commit to martin-frbg/OpenBLAS that referenced this issue Aug 25, 2023

clarify the comment on the out-of-bounds check from OpenMathLib#723

22a402b

martin-frbg added a commit to martin-frbg/OpenBLAS that referenced this issue Aug 26, 2023

clarify the comment on the out-of-bounds check from OpenMathLib#723

3b6050a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault when inverting matrix containing nans #723

segfault when inverting matrix containing nans #723

AllinCottrell commented Dec 23, 2015

martin-frbg commented Dec 24, 2015

AllinCottrell commented Dec 24, 2015

martin-frbg commented Dec 24, 2015

AllinCottrell commented Dec 24, 2015

brada4 commented Dec 29, 2015

brada4 commented Dec 29, 2015

AllinCottrell commented Dec 30, 2015

brada4 commented Dec 30, 2015

martin-frbg commented Dec 30, 2015

wernsaar commented Jan 26, 2016

xianyi commented Jan 26, 2016

martin-frbg commented Jan 26, 2016

segfault when inverting matrix containing nans #723

segfault when inverting matrix containing nans #723

Comments

AllinCottrell commented Dec 23, 2015

martin-frbg commented Dec 24, 2015

AllinCottrell commented Dec 24, 2015

martin-frbg commented Dec 24, 2015

AllinCottrell commented Dec 24, 2015

brada4 commented Dec 29, 2015

brada4 commented Dec 29, 2015

AllinCottrell commented Dec 30, 2015

brada4 commented Dec 30, 2015

martin-frbg commented Dec 30, 2015

wernsaar commented Jan 26, 2016

xianyi commented Jan 26, 2016

martin-frbg commented Jan 26, 2016