-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault when inverting matrix containing nans #723
Comments
Probably same issue as #671 - with NaN elements, the function that computes the pivot point returns "some array element number" in netlib, while the machine-optimized version in OpenBLAS currently returns "some number" which can be outside the bounds of the array. As a stopgap solution for #671, the returned value was clamped in the caller, while the general solution needs to be in the called function. (Though using nan values arguably takes you into the realm of "undefined behaviour", which may include crashes, flying pigs or any combination thereof) |
Ah, thanks! I just noticed #671 myself. I'll try clamping the ipiv values from dgetrf. As for undefined behaviour, I don't think that should be the case. According to the netlib LAPACK FAQ one should be able to expect IEEE-754 behavior: "LAPACK, version 3.0, introduced new routines which rely on IEEE-754 compliance. [...] As a result, two settings were added to LAPACK/SRC/ilaenv.f to denote IEEE-754 compliance for NaN and infinity arithmetic, respectively. By default, ILAENV assumes an IEEE machine [...]" |
That may be their goal, but see #642 and associated netlib bug - probably not everything in LAPACK is NaN-safe yet. |
Granted. And it's true that the pivot order is undefined if the matrix contains NaNs. I wouldn't object if dgetrf flagged an error. However, I think that dgetrf returning out-of-bounds values in the pivot array is a bug. Many of us expect OpenBLAS to work like netlib lapack/blas (but much faster -- thanks, developers!), so if "info" is clear when dgetrf returns, we don't expect a subsequent call to dgetri to destroy the heap and crash our application. |
Does not crash with netlib lapack + openblas blas (pthread,omp,single) |
Something corrupts one of pointers before invert_general_matrix (testcase.c:54) (relevant part of helgrind output) |
In my testing, there's no problem with the pointers before the second call to dgetri (the first call is just a workspace query, but on the second call dgetri tries to make use of the invalid ipiv array generated by dgetrf). I'm attaching a revised version of my test program, and its output. |
Maybe i am wrong, but seems to fail to read 2nd matrix (bt, disas /0x.... i.e search crash address in gdb) |
To spell it out, the second element of the ipiv array generated by dgetrf is 11 while the dimension of the matrix is 10 - this is the similarity to #671. A simple |
Hi, I wrote the following test code: #include <stdlib.h> extern int idamax_( int *N, double *x, int *INC_X); main()
} The results are: Linked with refblas: Normal : 3 0.300000 Linked with OpenBLAS generic kernel: Normal : 3 0.300000 Linked with OpenBLAS optimized kernel: Normal : 3 0.300000 The optimized kernel, where idamax is written in assembly, Best regards |
@martin-frbg , you are right. I just modified |
Glad to know that my comment was not completely wrong. Once the assembler code is improved, none of these guards should be strictly necessary anymore, at least three tickets can be closed for good (if I counted correctly), and maybe I can even learn a bit more about x86 assembler from comparing the old and new versions then. |
On x86_64 Linux, using 0.2.16.dev, I'm getting a crash when using dgretrf/dgetri to
invert a general (square) matrix with nan values. Obviously one should not pass
such a matrix to such a function, but this can happen inadvertently and BLAS really
shouldn't crash. The regular netlib doesn't crash. I'm attaching a test program that
demonstrates the issue. I'm also attaching the valgrind log from the crash.
valgrind.log.txt
blascrash.c.txt
The text was updated successfully, but these errors were encountered: