GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

amurzeau · 2018-08-07T22:21:02Z

Hi,

There is a bug open in Debian related to gimp 2.10.2 and openblas 3.2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514.

Depending on the machine and environment used, gimp can deadlock at startup because of a deadlock inside glibc.
I'm forwarding what I wrote for the Debian bug tracker:

Using gdb to find where it hung (gimp-gdb.txt) gives threads waiting on
a lock while doing thread-local related stuff and the main thread is in
the process of dl_close-ing openblas waiting the threads to exit using
pthread_join.

It seems that the lock used in tls_get_addr_tail [0] is the same as
the one locked by _dl_close [1].
A recursive lock is used but here it does not help as the thread calling
tls_get_addr_tail and _dl_close are not the same.

This deadlock may not happen everytime, in my case, the openblas threads
are still initializing while _dl_close is called.

Given this, I think the offending commit in openblas is bf40f80 [2]
which add TLS variables to avoid locking. But many change were done
since then.

One of related bug report is [3] which seems to indicate that the locks
handling is not easy inside glibc.

There were an attempt to fix deadlocks between tls_get_addr and a
_dl_close of a module whose finalizer joins with that thread [4].

So I see these possibles solutions:

Add a breaks between gimp and openblas
Disable TLS in openblas build (if possible, but this would cause a
performance loss for users that use openblas without gimp)
Patch glibc to not deadlock (but this seems not easy to do at all)

[0] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-tls.c#L761
[1] https://github.com/bminor/glibc/blob/glibc-2.27/elf/dl-close.c#L812

[2]
bf40f80#diff-31f8d4e8863583d95bf2f9529f83844e
[4] https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html

The text was updated successfully, but these errors were encountered:

amurzeau · 2018-08-07T22:28:21Z

As I see the implementation of glibc, it seems impossible to guarantee that no deadlock will happen when dl_closing openblas.

I don't think disabling TLS in openblas is a good solution as it would cause a performance hit even for users that don't use gimp.

Have you encountered issues with TLS before ?
I don't know well openblas, but maybe there a way to ensure the threads are running and were all initialized (ie: they are waiting in their while loop in blas_thread_server) before dl_close is called by the user ?

brada4 · 2018-08-08T05:03:20Z

Can you try building openblas with openmp? That should avoid present problem code.
There should be not that much locking in certain shutdown.

martin-frbg · 2018-08-08T06:04:36Z

How is Gimp built on Debian - does it use OpenMP by any chance ? A similar lockup was noted with the Arch Linux build of Gimp when using an OpenBLAS built without OpenMP support in conjuction with their OpenMP-enabled package of Gimp #240 (comment)

brada4 · 2018-08-08T06:50:34Z

debian default is without omp, ubuntu with.
They split build into blas and lapack libs

MAKE_OPTIONS := NO_LAPACKE=1 NO_AFFINITY=1 USE_OPENMP=0 NO_WARMUP=1 CFLAGS="$(CPPFLAGS) $(CFLAGS)" FFLAGS="$(FFLAGS)" # Avoid having -O2 automatically added (especially for DEB_BUILD_OPTIONS=noopt) MAKE_OPTIONS += COMMON_OPT= FCOMMON_OPT=-frecursive

oon3m0oo · 2018-08-08T08:16:00Z

Would you be willing to try stripping out the logic in memory.c which sets HAS_COMPILER_TLS? I wonder if you'd have better luck with pthreads' TLS.

amurzeau · 2018-08-08T10:52:25Z

Thanks for your pointers :-), I will try these. -- Alexis Murzeau

amurzeau · 2018-08-08T21:53:31Z

I tried to compile OpenBLAS without HAS_COMPILER_TLS (with a #undef) and gimp seems to work fine over a couple of runs (10~).
No deadlock at all while with the previous binaries (with HAS_COMPILER_TLS enabled), gimp would deadlock > 50% of the time.
So pthreads' TLS seems better.

I did not try to compile OpenBLAS with open MP enabled. But I'm not sure this will change something as there is no #ifdef in memory.c about open MP (all #ifdef were replaced with USE_OPENMP_UNUSED in commit b14f44d).

I'm not sure of the implications of removing HAS_COMPILER_TLS, I would think that this is Ok to use pthreads which should have similar performance than compiler's builtins while being explicitly written in the source code.
Would this be a solution if it is confirmed that this fixes the deadlock ?

amurzeau · 2018-08-08T22:24:12Z

For the record, the associated Debian bug is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=903514.

I've proposed a patch removing HAS_COMPILER_TLS to test it on others machine.

brada4 · 2018-08-09T05:28:17Z

Can you check release before this pr?
#1625

amurzeau · 2018-08-09T20:24:52Z

I've tried openblas 3.2 + change memory.c contents to its content at commit f66b9c8 (this is the parent commit of the PR #1625).
No build error, and no gimp deadlock.
But I guess that this way, the performance won't be as good as just using pthread's TLS.

oon3m0oo · 2018-08-10T08:14:50Z

@amurzeau can you try out #1726? You'll just need to define USE_COMPILER_TLS to 0 in your build; that PR allows you to override the flag.

oon3m0oo · 2018-08-10T12:57:26Z

Another way to resolve this would be to call dlclose() on the right thread, which I believe is controlled by gimp. I'm also curious why gimp would be dlclosing openblas while threads are still running.

martin-frbg · 2018-08-10T13:10:46Z

Could be that some component of gimp is just trying to dlopen() any (optional ?) dependency on startup to see if it will be actually available when needed ? I do not think we can fix pre-existing behaviour in every caller out there, in particular if it is something that used to work before a certain change.

brada4 · 2018-08-10T19:05:36Z

Also would be nice to check /proc/pid/maps for other blas implementation statically linked or dynamically loaded by some plugin loaded before openblas. lib(t)atlas.so being one suspect here, or mkl.

amurzeau · 2018-08-10T20:13:49Z

Hi,

Actually, openblas is loaded and then unloaded by GEGL as part of loading and unloading /usr/lib/x86_64-linux-gnu/gegl-0.4/matting-levin.so: https://github.com/GNOME/gegl/blob/758b21f68438c53496078fb5cf177166f32e603a/gegl/module/geglmodule.c#L205

matting-levin.so is linked with openblas.

I do not have libaltlas or mkl installed.

$ cat /proc/2187/maps | grep -i -e blas -e atlas -e mkl | cut -d' ' -f25 | uniq
/usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.2.so
/usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
/usr/lib/x86_64-linux-gnu/openblas/libblas.so.3

while the PID being gimp.

As GEGL is loading then unloading the library, this cause cause dl_close to be called just after dl_open was called, this is why the openblas threads are not fully initialized at the time of dl_close.

I will try #1726.

amurzeau · 2018-08-10T22:12:38Z

I tried the PR and it works fine with a modification of the build system. See my comment on the PR.

brada4 · 2018-08-11T13:29:58Z

It is better to link to -lblas -llapack alternatives that one can choose openblas or not at runtime...
Though this does not justify crashes....

amurzeau · 2018-08-11T14:09:51Z

It is just a /proc/XXX/maps dump, I don't think it is directly linked to openblas.
The dependency chain is actually:

gimp needs libgegl-0.4.so.0
libgegl-0.4.so.0 dlopen gegl-0.4/matting-levin.so
gegl-0.4/matting-levin.so needs libumfpack.so.5
libumfpack.so.5 needs liblapack.so.3 and libblas.so.3
liblapack.so.3 and libblas.so.3 need libopenblas.so.0

While liblapack.so.3 and libblas.so.3 being the openblas implementation.

About the crash, I believe there is no easy way to fix this unless not using compiler builtin TLS (when it uses glibc implementation). See also https://sourceware.org/ml/libc-alpha/2015-06/msg00062.html for a attempt to fix some of the deadlocks in glibc.

The proposed PR is IMO a good way to fix this with small performance hit (if any) by using pthreads TLS instead and it keeps the amount of modified lines rather low. This way, suggestions to fix the gimp deadlock won't be "remove openblas" anymore and both will work together again.
I don't think glibc will be fixed any time soon given the dates of attempts to fix it.

martin-frbg · 2018-08-11T15:08:57Z

I'm trying to run a slightly modified version of your test on travis now to see how the various libcs behave with compiler TLS.

amurzeau · 2018-08-11T15:23:40Z

Actually, the test program will hang if the test fail (deadlocking). A timeout is needed to handle the failure without hanging the whole build.

Maybe:

in main, spawn a thread that does X tries (dlopen+dlclose) and then exit the thread
in main, join that thread with a timeout of Y seconds. If this timeout, the test fails else it succeed

martin-frbg · 2018-08-11T15:41:25Z

As this is intended as a one-time check I have taken the lazy approach of just letting failed jobs hit their time limit. (CI has a fixed limit of one hour for opensource projects, we used to hit that regularly before the default DYNAMIC_ARCH list was reduced.)

brada4 · 2018-08-11T15:48:45Z

@amurzeau ok, packaging is correct.

martin-frbg · 2018-08-11T16:56:23Z

CI does not seem to build the .so by default though, foiling my test...

martin-frbg · 2018-08-12T15:15:14Z

Actually the test did get executed, I was only confused by inconsistent CI output where it turned out the test was still running. My conclusion is that neither musl-libc nor OSX are affected by the problem (and neither is Android, according to a separate test on local hardware). So what I would like to do is
disable compiler TLS on systems that use glibc, which can be achieved by adding

#if defined(__GLIBC_PREREQ)
#undef HAS_COMPILER_TLS
#endif

(independent of a decsion to make this user-definable as proposed in #1726 - IMHO we need a default fix that keeps users from running into this issue before they even know it exists.)

amurzeau · 2018-08-15T16:02:31Z

Hi,

Here is a test with a timeout that allows checking the result of the test even in case of a failure:
https://gist.github.com/amurzeau/dda0d50b76f3752758a12274a4c7ffe5
Maybe a bit cleaner than just waiting an hour for the build to fail ^^.

About autodetecting GLIBC, isn't just using pthread's TLS every-time instead of trying to use compiler's TLS when available and not buggy, also a solution ?
That would simplify #ifdef on code that use TLS and make the behavior more uniform across platforms / targets.
But I'm not sure on the performance implications. It seems to be rather low as only get_memory_table seems to use TLS, and it is called in blas_memory_alloc and blas_memory_free.

martin-frbg · 2018-08-15T17:25:39Z

Thanks for the test, I'll see if this can get added to the regression tests in utest. Without knowing the platform implications and relative performance of both, I do not think it is desirable to disable compiler TLS everywhere - as my dirty little test showed, the problem appears to be limited to glibc so it should be sufficient to avoid it there.

martin-frbg · 2018-08-31T06:24:01Z

As the latest PR#1739 introduced new issues and no short-term solution appears to be forthcoming, I have made the original memory allocation code from before 0.3.1 the default again in 0.3.3 . If you know or assume that your code is not affected by the remaining problems of the new TLS allocator, and your GLIBC version is at least 2.20 you can compile OpenBLAS with the new option USE_TLS
to get the new code (based on #1739).

amurzeau · 2018-09-04T22:27:18Z

Hi,

I've tried the patch (PR: #1742) against https://gist.github.com/amurzeau/dda0d50b76f3752758a12274a4c7ffe5 and found that it crash when the thread that does dl_open exits.
The crash has the same cause as https://bugzilla.redhat.com/show_bug.cgi?id=1065695.
The issue is that when doing dl_close, pthread_key_destroy is not called on local_storage_key which cause a TLS leak I guess. When the thread doing dl_open / dl_close exits, it try to close leaked TLS but the library is unloaded already (via dl_close) and this cause a segfault.

The stacktrace is:

Thread 2 (Thread 0x7ffff7db7700 (LWP 18976)):
#0  0x00007ffff6d9cf80 in ?? ()
#1  0x00007ffff7f7f058 in __nptl_deallocate_tsd () at pthread_create.c:300
#2  0x00007ffff7f8007c in __nptl_deallocate_tsd () at ../sysdeps/nptl/futex-internal.h:200
#3  start_thread (arg=0x7ffff7db7700) at pthread_create.c:473
#4  0x00007ffff7eb2edf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7db8740 (LWP 18975)):
#0  0x00007ffff7f8138d in __GI___pthread_timedjoin_ex (threadid=140737351743232, thread_return=0x7fffffffdf98, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
#1  0x00005555555553b9 in main (argc=2, argv=0x7fffffffe0e8) at test_openblas.c:68

(Crashed in Thread 2)

I think this is related to this (from the pthread_key_create manpage):

An optional destructor function may be associated with each key value. At thread exit, if a key value has a non-NULL destructor pointer, and the thread has a non-NULL value associated with that key, the value of the key is set to NULL, and then the function pointed to is called with the previously associated value as its sole argument. The order of destructor calls is unspecified if more than one destructor exists for a thread when it exits.

pthread_key_delete need to be called somewhere to fix this, as done in CUPS. But it won't call the defined destructor when creating the key.
See a patch here: https://bugzilla.redhat.com/attachment.cgi?id=891985&action=diff

I suggest calling pthread_key_delete inside blas_memory_cleanup but I've not tried this as of now.
I wonder why I don't get any crash with the patch that just disable USE_COMPILER_TLS (https://salsa.debian.org/science-team/openblas/merge_requests/1/diffs).
TLS stuff in glibc is really not that stable :(

martin-frbg · 2018-09-05T07:17:40Z

Sorry, which patch did you test there ?

amurzeau · 2018-09-05T10:46:55Z

I've tested that PR: #1742 on version 3.2. I've included all commits in this PR.

amurzeau · 2018-09-05T10:48:39Z

By version 3.2, I mean the patch backported to version 3.2.

oon3m0oo mentioned this issue Aug 10, 2018

Allow overriding USE_COMPILER_TLS (formerly HAS_COMPILER_TLS). #1726

Open

martin-frbg added this to the 0.3.3 milestone Aug 12, 2018

oon3m0oo mentioned this issue Aug 16, 2018

Support arbitrary numbers of threads for memory allocation. #1739

Open

martin-frbg modified the milestones: 0.3.3, 0.3.4 Aug 31, 2018

martin-frbg modified the milestones: 0.3.4, 0.3.5 Dec 2, 2018

martin-frbg mentioned this issue Dec 19, 2018

Segfault when building haskelll-opencv with openblas >= 0.3.3 #1923

Closed

martin-frbg mentioned this issue Dec 29, 2018

Delete the pthread key on cleanup in TLS mode #1942

Merged

martin-frbg closed this as completed in #1942 Dec 30, 2018

martin-frbg mentioned this issue Jan 8, 2019

0.3.5 regression #1954

Closed

martin-frbg mentioned this issue Dec 31, 2019

win32: Don't rely on timeout at all #2350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

amurzeau commented Aug 7, 2018 •

edited

Loading

amurzeau commented Aug 7, 2018

brada4 commented Aug 8, 2018

martin-frbg commented Aug 8, 2018

brada4 commented Aug 8, 2018

oon3m0oo commented Aug 8, 2018

amurzeau commented Aug 8, 2018 via email

amurzeau commented Aug 8, 2018

amurzeau commented Aug 8, 2018

brada4 commented Aug 9, 2018

amurzeau commented Aug 9, 2018

oon3m0oo commented Aug 10, 2018

oon3m0oo commented Aug 10, 2018

martin-frbg commented Aug 10, 2018

brada4 commented Aug 10, 2018

amurzeau commented Aug 10, 2018

amurzeau commented Aug 10, 2018

brada4 commented Aug 11, 2018

amurzeau commented Aug 11, 2018

martin-frbg commented Aug 11, 2018

amurzeau commented Aug 11, 2018

martin-frbg commented Aug 11, 2018 •

edited

Loading

brada4 commented Aug 11, 2018

martin-frbg commented Aug 11, 2018

martin-frbg commented Aug 12, 2018

amurzeau commented Aug 15, 2018 •

edited

Loading

martin-frbg commented Aug 15, 2018

martin-frbg commented Aug 31, 2018

amurzeau commented Sep 4, 2018 •

edited

Loading

martin-frbg commented Sep 5, 2018

amurzeau commented Sep 5, 2018 via email

amurzeau commented Sep 5, 2018 via email

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

GIMP hang / deadlock in get_memory_table / blas_thread_shutdown #1720

Comments

amurzeau commented Aug 7, 2018 • edited Loading

amurzeau commented Aug 7, 2018

brada4 commented Aug 8, 2018

martin-frbg commented Aug 8, 2018

brada4 commented Aug 8, 2018

oon3m0oo commented Aug 8, 2018

amurzeau commented Aug 8, 2018 via email

amurzeau commented Aug 8, 2018

amurzeau commented Aug 8, 2018

brada4 commented Aug 9, 2018

amurzeau commented Aug 9, 2018

oon3m0oo commented Aug 10, 2018

oon3m0oo commented Aug 10, 2018

martin-frbg commented Aug 10, 2018

brada4 commented Aug 10, 2018

amurzeau commented Aug 10, 2018

amurzeau commented Aug 10, 2018

brada4 commented Aug 11, 2018

amurzeau commented Aug 11, 2018

martin-frbg commented Aug 11, 2018

amurzeau commented Aug 11, 2018

martin-frbg commented Aug 11, 2018 • edited Loading

brada4 commented Aug 11, 2018

martin-frbg commented Aug 11, 2018

martin-frbg commented Aug 12, 2018

amurzeau commented Aug 15, 2018 • edited Loading

martin-frbg commented Aug 15, 2018

martin-frbg commented Aug 31, 2018

amurzeau commented Sep 4, 2018 • edited Loading

martin-frbg commented Sep 5, 2018

amurzeau commented Sep 5, 2018 via email

amurzeau commented Sep 5, 2018 via email

amurzeau commented Aug 7, 2018 •

edited

Loading

martin-frbg commented Aug 11, 2018 •

edited

Loading

amurzeau commented Aug 15, 2018 •

edited

Loading

amurzeau commented Sep 4, 2018 •

edited

Loading