-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720
Comments
As I see the implementation of glibc, it seems impossible to guarantee that no deadlock will happen when dl_closing openblas. I don't think disabling TLS in openblas is a good solution as it would cause a performance hit even for users that don't use gimp. Have you encountered issues with TLS before ? |
Can you try building openblas with openmp? That should avoid present problem code. |
How is Gimp built on Debian - does it use OpenMP by any chance ? A similar lockup was noted with the Arch Linux build of Gimp when using an OpenBLAS built without OpenMP support in conjuction with their OpenMP-enabled package of Gimp #240 (comment) |
debian default is without omp, ubuntu with.
|
Would you be willing to try stripping out the logic in memory.c which sets HAS_COMPILER_TLS? I wonder if you'd have better luck with pthreads' TLS. |
Thanks for your pointers :-), I will try these.
--
Alexis Murzeau
|
I tried to compile OpenBLAS without HAS_COMPILER_TLS (with a #undef) and gimp seems to work fine over a couple of runs (10~). I did not try to compile OpenBLAS with open MP enabled. But I'm not sure this will change something as there is no #ifdef in memory.c about open MP (all #ifdef were replaced with USE_OPENMP_UNUSED in commit b14f44d). I'm not sure of the implications of removing HAS_COMPILER_TLS, I would think that this is Ok to use pthreads which should have similar performance than compiler's builtins while being explicitly written in the source code. |
For the record, the associated Debian bug is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514. I've proposed a patch removing HAS_COMPILER_TLS to test it on others machine. |
Can you check release before this pr? |
Another way to resolve this would be to call dlclose() on the right thread, which I believe is controlled by gimp. I'm also curious why gimp would be dlclosing openblas while threads are still running. |
Could be that some component of gimp is just trying to dlopen() any (optional ?) dependency on startup to see if it will be actually available when needed ? I do not think we can fix pre-existing behaviour in every caller out there, in particular if it is something that used to work before a certain change. |
Also would be nice to check /proc/pid/maps for other blas implementation statically linked or dynamically loaded by some plugin loaded before openblas. lib(t)atlas.so being one suspect here, or mkl. |
Hi, Actually, openblas is loaded and then unloaded by GEGL as part of loading and unloading /usr/lib/x86_64-linux-gnu/gegl-0.4/matting-levin.so: https://github.com/GNOME/gegl/blob/758b21f68438c53496078fb5cf177166f32e603a/gegl/module/geglmodule.c#L205 matting-levin.so is linked with openblas. I do not have libaltlas or mkl installed.
while the PID being gimp. As GEGL is loading then unloading the library, this cause cause dl_close to be called just after dl_open was called, this is why the openblas threads are not fully initialized at the time of dl_close. I will try #1726. |
I tried the PR and it works fine with a modification of the build system. See my comment on the PR. |
It is better to link to -lblas -llapack alternatives that one can choose openblas or not at runtime... |
It is just a /proc/XXX/maps dump, I don't think it is directly linked to openblas.
While About the crash, I believe there is no easy way to fix this unless not using compiler builtin TLS (when it uses glibc implementation). See also https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html for a attempt to fix some of the deadlocks in glibc. The proposed PR is IMO a good way to fix this with small performance hit (if any) by using pthreads TLS instead and it keeps the amount of modified lines rather low. This way, suggestions to fix the gimp deadlock won't be "remove openblas" anymore and both will work together again. |
I'm trying to run a slightly modified version of your test on travis now to see how the various libcs behave with compiler TLS. |
Actually, the test program will hang if the test fail (deadlocking). A timeout is needed to handle the failure without hanging the whole build. Maybe:
|
As this is intended as a one-time check I have taken the lazy approach of just letting failed jobs hit their time limit. (CI has a fixed limit of one hour for opensource projects, we used to hit that regularly before the default DYNAMIC_ARCH list was reduced.) |
@amurzeau ok, packaging is correct. |
CI does not seem to build the .so by default though, foiling my test... |
Actually the test did get executed, I was only confused by inconsistent CI output where it turned out the test was still running. My conclusion is that neither musl-libc nor OSX are affected by the problem (and neither is Android, according to a separate test on local hardware). So what I would like to do is
(independent of a decsion to make this user-definable as proposed in #1726 - IMHO we need a default fix that keeps users from running into this issue before they even know it exists.) |
Hi, Here is a test with a timeout that allows checking the result of the test even in case of a failure: About autodetecting GLIBC, isn't just using pthread's TLS every-time instead of trying to use compiler's TLS when available and not buggy, also a solution ? |
Thanks for the test, I'll see if this can get added to the regression tests in utest. Without knowing the platform implications and relative performance of both, I do not think it is desirable to disable compiler TLS everywhere - as my dirty little test showed, the problem appears to be limited to glibc so it should be sufficient to avoid it there. |
As the latest PR#1739 introduced new issues and no short-term solution appears to be forthcoming, I have made the original memory allocation code from before 0.3.1 the default again in 0.3.3 . If you know or assume that your code is not affected by the remaining problems of the new TLS allocator, and your GLIBC version is at least 2.20 you can compile OpenBLAS with the new option USE_TLS |
Hi, I've tried the patch (PR: #1742) against https://gist.github.com/amurzeau/dda0d50b76f3752758a12274a4c7ffe5 and found that it crash when the thread that does dl_open exits. The stacktrace is:
(Crashed in Thread 2) I think this is related to this (from the pthread_key_create manpage):
I suggest calling |
Sorry, which patch did you test there ? |
I've tested that PR: #1742 on version 3.2.
I've included all commits in this PR.
|
By version 3.2, I mean the patch backported to version 3.2.
|
Hi,
There is a bug open in Debian related to gimp 2.10.2 and openblas 3.2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514.
Depending on the machine and environment used, gimp can deadlock at startup because of a deadlock inside glibc.
I'm forwarding what I wrote for the Debian bug tracker:
Using gdb to find where it hung (gimp-gdb.txt) gives threads waiting on
a lock while doing thread-local related stuff and the main thread is in
the process of dl_close-ing openblas waiting the threads to exit using
pthread_join.
It seems that the lock used in
tls_get_addr_tail
[0] is the same asthe one locked by
_dl_close
[1].A recursive lock is used but here it does not help as the thread calling
tls_get_addr_tail
and_dl_close
are not the same.This deadlock may not happen everytime, in my case, the openblas threads
are still initializing while
_dl_close
is called.Given this, I think the offending commit in openblas is bf40f80 [2]
which add TLS variables to avoid locking. But many change were done
since then.
One of related bug report is [3] which seems to indicate that the locks
handling is not easy inside glibc.
There were an attempt to fix deadlocks between
tls_get_addr
and a_dl_close
of a module whose finalizer joins with that thread [4].So I see these possibles solutions:
performance loss for users that use openblas without gimp)
[0] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-tls.c#L761
[1] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-close.c#L812
[2]
bf40f80#diff-31f8d4e8863583d95bf2f9529f83844e
[4] https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html
The text was updated successfully, but these errors were encountered: