Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory error in dswap_k/dgetf2_k #671

Closed
kronbichler opened this issue Oct 22, 2015 · 8 comments
Closed

Memory error in dswap_k/dgetf2_k #671

kronbichler opened this issue Oct 22, 2015 · 8 comments

Comments

@kronbichler
Copy link

For a (admittedly corner case) simple 8x8 matrix inversion problem according to the code:

#include <limits>

extern "C"
{
  void dgetrf_ (const int *m, const int *n, double *A,
                const int *lda, int *ipiv, int *info);
  void dgetri_ (const int *n, double *A, const int *lda,
                int *ipiv, double *inv_work, const int *lwork, int *info);
}

int main()
{
  const int N = 8;
  const int lwork = 2*N;
  double *mat = new double[N*N];
  int *ipiv = new int[N];
  double *work = new double[lwork];
  int info = 0;
  for (int i=0; i<N; ++i)
    for (int j=0; j<N; ++j)
      mat[i*N+j] = -std::numeric_limits<double>::quiet_NaN();

  dgetrf_ (&N, &N, mat, &N, ipiv, &info);
  dgetri_ (&N, mat, &N, ipiv, work, &lwork, &info);

  return info;
}

I get memory access errors in both the factorization phase and the inversion phase:

mklap4:openblas_bug$ g++ -L/home/kronbichler/sw/lib/ -lopenblas -Wl,-rpath=/home/kronbichler/sw/lib/ -lopenblas test.cc 
mklap4:openblas_bug$ valgrind ./a.out 
==8649== Memcheck, a memory error detector
==8649== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==8649== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==8649== Command: ./a.out
==8649== 
==8649== Invalid read of size 8
==8649==    at 0x4D0DF3A: dgetf2_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4D0CD75: dgetrf_single (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AAEA64: dgetrf_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008D2: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4D0DF44: dgetf2_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4D0CD75: dgetrf_single (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AAEA64: dgetrf_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008D2: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid read of size 16
==8649==    at 0x4BEF93D: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4BEF942: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61e0 is 0 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid read of size 16
==8649==    at 0x4BEF94F: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61f0 is 16 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 
==8649== Invalid write of size 8
==8649==    at 0x4BEF954: dswap_k (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4AA4B86: dswap_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4E56F20: dgetri_ (in /home/kronbichler/sw/lib/libopenblas_haswell-r0.2.14.so)
==8649==    by 0x4008FB: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649==  Address 0x67a61f0 is 16 bytes after a block of size 512 alloc'd
==8649==    at 0x402D81C: operator new[](unsigned long) (vg_replace_malloc.c:422)
==8649==    by 0x400825: main (in /home/kronbichler/Work/deal_tests/trilinos_tests/openblas_bug/a.out)
==8649== 

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 576, hi = 18444492273895866368.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

The error seems to come from the dswap routines that do partial pivoting. The matrix does only contain NaN and inversion makes no sense, but OpenBLAS should not create memory access errors.

I compiled openBLAS from the latest git source but also checked release 0.2.14. Appears on both haswell compilation (see above) and penryn compilation. Compilers: gcc/gfortran 5.2, no other special options in openBLAS build process.

@martin-frbg
Copy link
Collaborator

Looks like some kind of calling convention problem between C and FORTRAN, does the code snippet in https://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=23 work for you ?

@kronbichler
Copy link
Author

I don't think it's a calling convention problem because in that case the code should fail in any case. If I add sensible matrix data, e.g.,

      mat[i*N+j] = 1. + 0.2*rand()/RAND_MAX;

instead of std::numeric_limits<double>::quiet_NaN(), I do not see any memory errors. It really is due to the degenerate matrix in my opinion. Ah, one detail I forgot: The error is observed on x86_64 linux.

@kronbichler
Copy link
Author

Looking closer, the following pivots are computed by dgetrf:

9 2 3 4 5 6 7 8

The first index 9 is out of bound for a matrix of size 8. (So the downstream problems valgrind detects in dgetri might be due to that.) But there must be a bug in the OpenBLAS implementation of dgetrf_ to make it return this number.

@martin-frbg
Copy link
Collaborator

Alright, it is something about the handling of NaN values in input (or even a matrix composed exclusively of them) - the same code works fine even if I use infinity() for all the values. (And just for grins - I can make it work with NaNs by setting only mat[7] to a non-NaN value. Does not work with just any matrix element, perhaps related to how the task is divided internally ?)
Not sure if this is a bug or just a curiousity in "undefined behaviour" territory... wonder how the generic netlib dgetrf behaves ?

@xianyi
Copy link
Collaborator

xianyi commented Oct 23, 2015

@kronbichler , because of the NaNs, the idamax (used by dgetf2 to select the pivot) may return the wrong overflowed index.

Please look at #624

@kronbichler
Copy link
Author

@xianyi Thanks for info. This could very well be the cause. Unfortunately, it looks like the implementation is done kernel/x86_64/iamax_sse2.S and I don't speak assembler so I can't help :-(

@martin-frbg If I call the netlib lapack (linked with -llapack -lblas using the system libraries), I get the following ipiv array:

1 2 3 4 5 6 7 8

Similarly, MKL gives me

8 8 8 4 8 8 8 8

Regarding undefined behavior: I don't know the specifics of the LAPACK interface with NaN numbers, but I don't like to work around a memory bug in our code that's two layers away from the LAPACK/BLAS implementation (I call the functions in a big C++ project that encapsulates the Trilinos EPETRA_LAPACK methods which in turn provide access to OpenBLAS). On the other hand, I really like the OpenBLAS project since it provides very good performance, much better than all other open-source alternatives we've tried.

@martin-frbg
Copy link
Collaborator

I agree that it is a bug, I am arguing that it "only" has this drastic consequences when the input is dubious already. Perhaps a kludgy temporary solution could be to clamp the value returned by ldamax
to the dimension of the matrix before basing any decisions on it in dgetrf ?

@xianyi
Copy link
Collaborator

xianyi commented Oct 23, 2015

@martin-frbg , I agree with you about the temporary fix. I check the return of i?max.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants