Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

Closed
amurzeau opened this issue Aug 7, 2018 · 31 comments · Fixed by #1942
Closed

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

amurzeau opened this issue Aug 7, 2018 · 31 comments · Fixed by #1942
Milestone

Comments

@amurzeau
Copy link

amurzeau commented Aug 7, 2018

Hi,

There is a bug open in Debian related to gimp 2.10.2 and openblas 3.2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514.

Depending on the machine and environment used, gimp can deadlock at startup because of a deadlock inside glibc.
I'm forwarding what I wrote for the Debian bug tracker:

Using gdb to find where it hung (gimp-gdb.txt) gives threads waiting on
a lock while doing thread-local related stuff and the main thread is in
the process of dl_close-ing openblas waiting the threads to exit using
pthread_join.

It seems that the lock used in tls_get_addr_tail [0] is the same as
the one locked by _dl_close [1].
A recursive lock is used but here it does not help as the thread calling
tls_get_addr_tail and _dl_close are not the same.

This deadlock may not happen everytime, in my case, the openblas threads
are still initializing while _dl_close is called.

Given this, I think the offending commit in openblas is bf40f80 [2]
which add TLS variables to avoid locking. But many change were done
since then.

One of related bug report is [3] which seems to indicate that the locks
handling is not easy inside glibc.

There were an attempt to fix deadlocks between tls_get_addr and a
_dl_close of a module whose finalizer joins with that thread [4].

So I see these possibles solutions:

  • Add a breaks between gimp and openblas
  • Disable TLS in openblas build (if possible, but this would cause a
    performance loss for users that use openblas without gimp)
  • Patch glibc to not deadlock (but this seems not easy to do at all)

[0] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-tls.c#L761
[1] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-close.c#L812

[2]
bf40f80#diff-31f8d4e8863583d95bf2f9529f83844e
[4] https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html

@amurzeau
Copy link
Author

amurzeau commented Aug 7, 2018

As I see the implementation of glibc, it seems impossible to guarantee that no deadlock will happen when dl_closing openblas.

I don't think disabling TLS in openblas is a good solution as it would cause a performance hit even for users that don't use gimp.

Have you encountered issues with TLS before ?
I don't know well openblas, but maybe there a way to ensure the threads are running and were all initialized (ie: they are waiting in their while loop in blas_thread_server) before dl_close is called by the user ?

@brada4
Copy link
Contributor

brada4 commented Aug 8, 2018

Can you try building openblas with openmp? That should avoid present problem code.
There should be not that much locking in certain shutdown.

@martin-frbg
Copy link
Collaborator

How is Gimp built on Debian - does it use OpenMP by any chance ? A similar lockup was noted with the Arch Linux build of Gimp when using an OpenBLAS built without OpenMP support in conjuction with their OpenMP-enabled package of Gimp #240 (comment)

@brada4
Copy link
Contributor

brada4 commented Aug 8, 2018

debian default is without omp, ubuntu with.
They split build into blas and lapack libs

MAKE_OPTIONS := NO_LAPACKE=1 NO_AFFINITY=1 USE_OPENMP=0 NO_WARMUP=1 CFLAGS="$(CPPFLAGS) $(CFLAGS)" FFLAGS="$(FFLAGS)" # Avoid having -O2 automatically added (especially for DEB_BUILD_OPTIONS=noopt) MAKE_OPTIONS += COMMON_OPT= FCOMMON_OPT=-frecursive

@oon3m0oo
Copy link
Contributor

oon3m0oo commented Aug 8, 2018

Would you be willing to try stripping out the logic in memory.c which sets HAS_COMPILER_TLS? I wonder if you'd have better luck with pthreads' TLS.

@amurzeau
Copy link
Author

amurzeau commented Aug 8, 2018 via email

@amurzeau
Copy link
Author

amurzeau commented Aug 8, 2018

I tried to compile OpenBLAS without HAS_COMPILER_TLS (with a #undef) and gimp seems to work fine over a couple of runs (10~).
No deadlock at all while with the previous binaries (with HAS_COMPILER_TLS enabled), gimp would deadlock > 50% of the time.
So pthreads' TLS seems better.

I did not try to compile OpenBLAS with open MP enabled. But I'm not sure this will change something as there is no #ifdef in memory.c about open MP (all #ifdef were replaced with USE_OPENMP_UNUSED in commit b14f44d).

I'm not sure of the implications of removing HAS_COMPILER_TLS, I would think that this is Ok to use pthreads which should have similar performance than compiler's builtins while being explicitly written in the source code.
Would this be a solution if it is confirmed that this fixes the deadlock ?

@amurzeau
Copy link
Author

amurzeau commented Aug 8, 2018

For the record, the associated Debian bug is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514.

I've proposed a patch removing HAS_COMPILER_TLS to test it on others machine.

@brada4
Copy link
Contributor

brada4 commented Aug 9, 2018

Can you check release before this pr?
#1625

@amurzeau
Copy link
Author

amurzeau commented Aug 9, 2018

I've tried openblas 3.2 + change memory.c contents to its content at commit f66b9c8 (this is the parent commit of the PR #1625).
No build error, and no gimp deadlock.
But I guess that this way, the performance won't be as good as just using pthread's TLS.

@oon3m0oo
Copy link
Contributor

@amurzeau can you try out #1726? You'll just need to define USE_COMPILER_TLS to 0 in your build; that PR allows you to override the flag.

@oon3m0oo
Copy link
Contributor

Another way to resolve this would be to call dlclose() on the right thread, which I believe is controlled by gimp. I'm also curious why gimp would be dlclosing openblas while threads are still running.

@martin-frbg
Copy link
Collaborator

Could be that some component of gimp is just trying to dlopen() any (optional ?) dependency on startup to see if it will be actually available when needed ? I do not think we can fix pre-existing behaviour in every caller out there, in particular if it is something that used to work before a certain change.

@brada4
Copy link
Contributor

brada4 commented Aug 10, 2018

Also would be nice to check /proc/pid/maps for other blas implementation statically linked or dynamically loaded by some plugin loaded before openblas. lib(t)atlas.so being one suspect here, or mkl.

@amurzeau
Copy link
Author

Hi,

Actually, openblas is loaded and then unloaded by GEGL as part of loading and unloading /usr/lib/x86_64-linux-gnu/gegl-0.4/matting-levin.so: https://github.com/GNOME/gegl/blob/758b21f68438c53496078fb5cf177166f32e603a/gegl/module/geglmodule.c#L205

matting-levin.so is linked with openblas.

I do not have libaltlas or mkl installed.

$ cat /proc/2187/maps | grep -i -e blas -e atlas -e mkl | cut -d' ' -f25 | uniq
/usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.2.so
/usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
/usr/lib/x86_64-linux-gnu/openblas/libblas.so.3

while the PID being gimp.

As GEGL is loading then unloading the library, this cause cause dl_close to be called just after dl_open was called, this is why the openblas threads are not fully initialized at the time of dl_close.

I will try #1726.

@amurzeau
Copy link
Author

I tried the PR and it works fine with a modification of the build system. See my comment on the PR.

@brada4
Copy link
Contributor

brada4 commented Aug 11, 2018

It is better to link to -lblas -llapack alternatives that one can choose openblas or not at runtime...
Though this does not justify crashes....

@amurzeau
Copy link
Author

It is just a /proc/XXX/maps dump, I don't think it is directly linked to openblas.
The dependency chain is actually:

  • gimp needs libgegl-0.4.so.0
  • libgegl-0.4.so.0 dlopen gegl-0.4/matting-levin.so
  • gegl-0.4/matting-levin.so needs libumfpack.so.5
  • libumfpack.so.5 needs liblapack.so.3 and libblas.so.3
  • liblapack.so.3 and libblas.so.3 need libopenblas.so.0

While liblapack.so.3 and libblas.so.3 being the openblas implementation.

About the crash, I believe there is no easy way to fix this unless not using compiler builtin TLS (when it uses glibc implementation). See also https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html for a attempt to fix some of the deadlocks in glibc.

The proposed PR is IMO a good way to fix this with small performance hit (if any) by using pthreads TLS instead and it keeps the amount of modified lines rather low. This way, suggestions to fix the gimp deadlock won't be "remove openblas" anymore and both will work together again.
I don't think glibc will be fixed any time soon given the dates of attempts to fix it.

@martin-frbg
Copy link
Collaborator

I'm trying to run a slightly modified version of your test on travis now to see how the various libcs behave with compiler TLS.

@amurzeau
Copy link
Author

Actually, the test program will hang if the test fail (deadlocking). A timeout is needed to handle the failure without hanging the whole build.

Maybe:

  • in main, spawn a thread that does X tries (dlopen+dlclose) and then exit the thread
  • in main, join that thread with a timeout of Y seconds. If this timeout, the test fails else it succeed

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 11, 2018

As this is intended as a one-time check I have taken the lazy approach of just letting failed jobs hit their time limit. (CI has a fixed limit of one hour for opensource projects, we used to hit that regularly before the default DYNAMIC_ARCH list was reduced.)

@brada4
Copy link
Contributor

brada4 commented Aug 11, 2018

@amurzeau ok, packaging is correct.

@martin-frbg
Copy link
Collaborator

CI does not seem to build the .so by default though, foiling my test...

@martin-frbg
Copy link
Collaborator

Actually the test did get executed, I was only confused by inconsistent CI output where it turned out the test was still running. My conclusion is that neither musl-libc nor OSX are affected by the problem (and neither is Android, according to a separate test on local hardware). So what I would like to do is
disable compiler TLS on systems that use glibc, which can be achieved by adding

#if defined(__GLIBC_PREREQ)
#undef HAS_COMPILER_TLS
#endif

(independent of a decsion to make this user-definable as proposed in #1726 - IMHO we need a default fix that keeps users from running into this issue before they even know it exists.)

@martin-frbg martin-frbg added this to the 0.3.3 milestone Aug 12, 2018
@amurzeau
Copy link
Author

amurzeau commented Aug 15, 2018

Hi,

Here is a test with a timeout that allows checking the result of the test even in case of a failure:
https://gist.github.com/amurzeau/dda0d50b76f3752758a12274a4c7ffe5
Maybe a bit cleaner than just waiting an hour for the build to fail ^^.

About autodetecting GLIBC, isn't just using pthread's TLS every-time instead of trying to use compiler's TLS when available and not buggy, also a solution ?
That would simplify #ifdef on code that use TLS and make the behavior more uniform across platforms / targets.
But I'm not sure on the performance implications. It seems to be rather low as only get_memory_table seems to use TLS, and it is called in blas_memory_alloc and blas_memory_free.

@martin-frbg
Copy link
Collaborator

Thanks for the test, I'll see if this can get added to the regression tests in utest. Without knowing the platform implications and relative performance of both, I do not think it is desirable to disable compiler TLS everywhere - as my dirty little test showed, the problem appears to be limited to glibc so it should be sufficient to avoid it there.

@martin-frbg
Copy link
Collaborator

As the latest PR#1739 introduced new issues and no short-term solution appears to be forthcoming, I have made the original memory allocation code from before 0.3.1 the default again in 0.3.3 . If you know or assume that your code is not affected by the remaining problems of the new TLS allocator, and your GLIBC version is at least 2.20 you can compile OpenBLAS with the new option USE_TLS
to get the new code (based on #1739).

@martin-frbg martin-frbg modified the milestones: 0.3.3, 0.3.4 Aug 31, 2018
@amurzeau
Copy link
Author

amurzeau commented Sep 4, 2018

Hi,

I've tried the patch (PR: #1742) against https://gist.github.com/amurzeau/dda0d50b76f3752758a12274a4c7ffe5 and found that it crash when the thread that does dl_open exits.
The crash has the same cause as https://bugzilla.redhat.com/show_bug.cgi?id=1065695.
The issue is that when doing dl_close, pthread_key_destroy is not called on local_storage_key which cause a TLS leak I guess. When the thread doing dl_open / dl_close exits, it try to close leaked TLS but the library is unloaded already (via dl_close) and this cause a segfault.

The stacktrace is:

Thread 2 (Thread 0x7ffff7db7700 (LWP 18976)):
#0  0x00007ffff6d9cf80 in ?? ()
#1  0x00007ffff7f7f058 in __nptl_deallocate_tsd () at pthread_create.c:300
#2  0x00007ffff7f8007c in __nptl_deallocate_tsd () at ../sysdeps/nptl/futex-internal.h:200
#3  start_thread (arg=0x7ffff7db7700) at pthread_create.c:473
#4  0x00007ffff7eb2edf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7db8740 (LWP 18975)):
#0  0x00007ffff7f8138d in __GI___pthread_timedjoin_ex (threadid=140737351743232, thread_return=0x7fffffffdf98, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
#1  0x00005555555553b9 in main (argc=2, argv=0x7fffffffe0e8) at test_openblas.c:68

(Crashed in Thread 2)

I think this is related to this (from the pthread_key_create manpage):

An optional destructor function may be associated with each key value. At thread exit, if a key value has a non-NULL destructor pointer, and the thread has a non-NULL value associated with that key, the value of the key is set to NULL, and then the function pointed to is called with the previously associated value as its sole argument. The order of destructor calls is unspecified if more than one destructor exists for a thread when it exits.

pthread_key_delete need to be called somewhere to fix this, as done in CUPS. But it won't call the defined destructor when creating the key.
See a patch here: https://bugzilla.redhat.com/attachment.cgi?id=891985&action=diff

I suggest calling pthread_key_delete inside blas_memory_cleanup but I've not tried this as of now.
I wonder why I don't get any crash with the patch that just disable USE_COMPILER_TLS (https://salsa.debian.org/science-team/openblas/merge_requests/1/diffs).
TLS stuff in glibc is really not that stable :(

@martin-frbg
Copy link
Collaborator

Sorry, which patch did you test there ?

@amurzeau
Copy link
Author

amurzeau commented Sep 5, 2018 via email

@amurzeau
Copy link
Author

amurzeau commented Sep 5, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants