Memory error in dswap_k/dgetf2_k #671

kronbichler · 2015-10-22T11:30:52Z

For a (admittedly corner case) simple 8x8 matrix inversion problem according to the code:

#include <limits>

extern "C"
{
  void dgetrf_ (const int *m, const int *n, double *A,
                const int *lda, int *ipiv, int *info);
  void dgetri_ (const int *n, double *A, const int *lda,
                int *ipiv, double *inv_work, const int *lwork, int *info);
}

int main()
{
  const int N = 8;
  const int lwork = 2*N;
  double *mat = new double[N*N];
  int *ipiv = new int[N];
  double *work = new double[lwork];
  int info = 0;
  for (int i=0; i<N; ++i)
    for (int j=0; j<N; ++j)
      mat[i*N+j] = -std::numeric_limits<double>::quiet_NaN();

  dgetrf_ (&N, &N, mat, &N, ipiv, &info);
  dgetri_ (&N, mat, &N, ipiv, work, &lwork, &info);

  return info;
}

I get memory access errors in both the factorization phase and the inversion phase:

mklap4:openblas_bug$ g++ -L/home/kronbichler/sw/lib/ -lopenblas -Wl,-rpath=/home/kronbichler/sw/lib/ -lopenblas test.cc 
mklap4:openblas_bug$ valgrind ./a.out 
==8649== Memcheck, a memory error detector
==8649== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==8649== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==8649== Command: ./a.out
==8649== 
==8649== Invalid read of size 8
==8649==    at 0x4D0DF3A: dgetf2_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4D0CD75: dgetrf_single (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AAEA64: dgetrf_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008D2: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4D0DF44: dgetf2_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4D0CD75: dgetrf_single (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AAEA64: dgetrf_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008D2: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid read of size 16
==8649==    at 0x4BEF93D: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4BEF942: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid read of size 16
==8649==    at 0x4BEF94F: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61f0 is 16 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4BEF954: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61f0 is 16 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 576, hi = 18444492273895866368.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

The error seems to come from the dswap routines that do partial pivoting. The matrix does only contain NaN and inversion makes no sense, but OpenBLAS should not create memory access errors.

I compiled openBLAS from the latest git source but also checked release 0.2.14. Appears on both haswell compilation (see above) and penryn compilation. Compilers: gcc/gfortran 5.2, no other special options in openBLAS build process.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2015-10-22T12:35:58Z

Looks like some kind of calling convention problem between C and FORTRAN, does the code snippet in https://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=23 work for you ?

kronbichler · 2015-10-22T13:00:40Z

I don't think it's a calling convention problem because in that case the code should fail in any case. If I add sensible matrix data, e.g.,

      mat[i*N+j] = 1. + 0.2*rand()/RAND_MAX;

instead of std::numeric_limits<double>::quiet_NaN(), I do not see any memory errors. It really is due to the degenerate matrix in my opinion. Ah, one detail I forgot: The error is observed on x86_64 linux.

kronbichler · 2015-10-22T13:16:50Z

Looking closer, the following pivots are computed by dgetrf:

9 2 3 4 5 6 7 8

The first index 9 is out of bound for a matrix of size 8. (So the downstream problems valgrind detects in dgetri might be due to that.) But there must be a bug in the OpenBLAS implementation of dgetrf_ to make it return this number.

martin-frbg · 2015-10-22T22:42:19Z

Alright, it is something about the handling of NaN values in input (or even a matrix composed exclusively of them) - the same code works fine even if I use infinity() for all the values. (And just for grins - I can make it work with NaNs by setting only mat[7] to a non-NaN value. Does not work with just any matrix element, perhaps related to how the task is divided internally ?)
Not sure if this is a bug or just a curiousity in "undefined behaviour" territory... wonder how the generic netlib dgetrf behaves ?

xianyi · 2015-10-23T00:27:58Z

@kronbichler , because of the NaNs, the idamax (used by dgetf2 to select the pivot) may return the wrong overflowed index.

Please look at #624

kronbichler · 2015-10-23T07:04:26Z

@xianyi Thanks for info. This could very well be the cause. Unfortunately, it looks like the implementation is done kernel/x86_64/iamax_sse2.S and I don't speak assembler so I can't help :-(

@martin-frbg If I call the netlib lapack (linked with -llapack -lblas using the system libraries), I get the following ipiv array:

1 2 3 4 5 6 7 8

Similarly, MKL gives me

8 8 8 4 8 8 8 8

Regarding undefined behavior: I don't know the specifics of the LAPACK interface with NaN numbers, but I don't like to work around a memory bug in our code that's two layers away from the LAPACK/BLAS implementation (I call the functions in a big C++ project that encapsulates the Trilinos EPETRA_LAPACK methods which in turn provide access to OpenBLAS). On the other hand, I really like the OpenBLAS project since it provides very good performance, much better than all other open-source alternatives we've tried.

martin-frbg · 2015-10-23T07:25:50Z

I agree that it is a bug, I am arguing that it "only" has this drastic consequences when the input is dubious already. Perhaps a kludgy temporary solution could be to clamp the value returned by ldamax
to the dimension of the matrix before basing any decisions on it in dgetrf ?

xianyi · 2015-10-23T17:19:13Z

@martin-frbg , I agree with you about the temporary fix. I check the return of i?max.

xianyi added a commit that referenced this issue Oct 23, 2015

Refs #671. the return of i?max cannot larger than N.

5a29160

This was referenced Dec 24, 2015

segfault when inverting matrix containing nans #723

Closed

Incorrect Idamax with NaN values #624

Closed

xianyi closed this as completed Mar 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory error in dswap_k/dgetf2_k #671

Memory error in dswap_k/dgetf2_k #671

kronbichler commented Oct 22, 2015

martin-frbg commented Oct 22, 2015

kronbichler commented Oct 22, 2015

kronbichler commented Oct 22, 2015

martin-frbg commented Oct 22, 2015

xianyi commented Oct 23, 2015

kronbichler commented Oct 23, 2015

martin-frbg commented Oct 23, 2015

xianyi commented Oct 23, 2015

Memory error in dswap_k/dgetf2_k #671

Memory error in dswap_k/dgetf2_k #671

Comments

kronbichler commented Oct 22, 2015

martin-frbg commented Oct 22, 2015

kronbichler commented Oct 22, 2015

kronbichler commented Oct 22, 2015

martin-frbg commented Oct 22, 2015

xianyi commented Oct 23, 2015

kronbichler commented Oct 23, 2015

martin-frbg commented Oct 23, 2015

xianyi commented Oct 23, 2015