Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win32: Don't rely on timeout at all #2350

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Keno
Copy link
Contributor

@Keno Keno commented Dec 29, 2019

Same issue as #2314, but this time seen in Julia. Having a timeout here is fundamentally unsound. There is no guarantee that the kernel will ever schedule the thread holding the lock. We need to rely on the pool threads to exit eventually (if they don't, that's a different bug that this can't work around - especially since we already don't do the termination for windows store apps). However, to prevent thread pool threads from blocking process exit, don't attempt to shut down BLAS if the process is about to exit - the kernel will kill all our resources anyway. After this patch, the problematic code section is only called if the dll is manually unloaded, which I think is
a) A much rarer use case and
b) even more important that we don't interrupt pool threads in the middle of work that that host application may want to look at later.

cc @Jehan

If the process is about to exit, there's no point trying to do
a bunch of work to clean up resources. The kernel will release
them much more efficiently when the process exits at the end
of this function.
Without this we run into a problem. After 50ms we forcably kill
the pool thread. However, if the pool thread currently holds the
allocation lock, but doesn't get scheduled within 50ms, said lock
will never be released causing an infinite deadlock and preventing
the DLL from releasing. Further, in the unload scenario (as opposed
to process exit), we may actually want to let the thread pool finish
in case it's working on finishing an operation on another thread
(since the user may be expecting the answer).
@Jehan
Copy link

Jehan commented Dec 29, 2019

@Keno So the WaitForSingleObject() won't happen in the normal use case in the end? If so, that's ok. Otherwise I can't agree with it since it would block the process forever in case of bugs or whatever. Even the 5s of our original patch, we had to lower it because GIMP ended up taking up to 10 minutes to start because of it. (cf. #2339).

This being said, we now have a new bug (a crash of GIMP) from some people, which apparently comes from these patchs we made (since this is the only difference between 2 installers we had of version 2.10.14), though we still haven't figured out why and how.
https://gitlab.gnome.org/GNOME/gimp/issues/4385

If your patch makes this whole section of the code not being called anymore and letting the kernel handling thread termination, then I guess it could make our problem go away.

@Keno
Copy link
Contributor Author

Keno commented Dec 29, 2019

It won't be called on process termination anymore, but it will still be called if FreeLibrary is called. I cannot guarantee that there aren't any other bugs here which will cause hangs in the latter case, but if so, any such bugs should be addressed on its own merits. There is no way to ensure that the thread we'd be killing here isn't in a critical section, so by fiddling with the timeout you're only moving around who sees the crashes.

@martin-frbg
Copy link
Collaborator

Probably best to confirm that the GIMP plugin does not get unloaded (and/or does not unload OpenBLAS) after the initial "are you there"...

@Keno
Copy link
Contributor Author

Keno commented Dec 29, 2019

If this patch does cause hangs in the GIMP plugin, I'd like to know about it, so we can try to fix it properly. The bottom line here is that the fix to #307 did not address the underlying issue and turned a deterministic hang into a non-deterministic crash/hang/corruption depending on system contention. I tried the original reproducer from #307 and cannot reproduce the hang, but as I said, I wouldn't be entirely surprised if there was still incorrect code somewhere else that did cause a hang when openblas is unloaded. Nevertheless a deterministic hang is a million times easier to debug than non-deterministic corruption, so if we do see a hang, let's root cause it properly and fix it in the proper place.

@Keno
Copy link
Contributor Author

Keno commented Dec 29, 2019

Actually, give me a second before trying it, I made a small typo in this patch.

Edit: Never mind, the patch is fine.

Keno added a commit to JuliaLang/julia that referenced this pull request Dec 31, 2019
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
Keno added a commit to JuliaLang/julia that referenced this pull request Dec 31, 2019
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
@martin-frbg
Copy link
Collaborator

Hanging after getting a thread blown away looks a lot like #2341 (though I am still not entirely convinced that the changed proposed there is the right thing to do)

@jernejs
Copy link

jernejs commented Dec 31, 2019

I just tried building OpenBLAS with this patch, and it causes GIMP to hang at startup (it hangs immediately, before even opening the console window for verbose messages):
image

@vtjnash
Copy link
Contributor

vtjnash commented Dec 31, 2019

There is no guarantee that the kernel will ever schedule the thread holding the lock

Actually, it's potentially worse than that: kernel documentation for this says that it guarantees it will never again schedule the thread since it's already called TerminateThread on them (https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices#best-practices-for-synchronization)

I don't think #2341 is related, since that appears to be Linux, and this is a Win32-only patch.

@Keno
Copy link
Contributor Author

Keno commented Dec 31, 2019

I just tried building OpenBLAS with this patch, and it causes GIMP to hang at startup

Ok, in that case it would be good to know how GIMP uses OpenBLAS, so we can try to fix the underlying issue here. A minimal reproducer would be ideal of course. In the meantime, perhaps, we can just apply the first commit win32: Don't run cleanup if we're about to exit anyway, which should not cause any issues (though I'd appreciate confirmation thereof).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 31, 2019

Right, #2341 is Linux-only... from #1720 (comment) it is probably the Levin-matting tool in the GIMP's GEGL library that causes a load/unload cycle of OpenBLAS on startup.
And the problem is old as dirt - #132 appears to mark the start of the troubles with enforcing thread shutdown to make GEGL happy. (Which makes me suspect that skipping the gotoblas_quit
unconditionally would not help. )

@vtjnash
Copy link
Contributor

vtjnash commented Jan 1, 2020

#1720 looks to me like probably basically the same issue as here, except Windows has a loader lock, which makes many more things difficult. The documentation for DllMain has some nice diagrams to explain why the code here is expected to cause deadlocks. In particular, it says that if another thread tries to exit while there's another thread waiting on it from inside the DllMain function, that will always deadlock (until the timeout returns). If I'm reading the docs correctly, the solution is to instead use a separate Event to signal cleanup completion (and use WaitForMultipleObjects on either one getting signaled).

@martin-frbg
Copy link
Collaborator

Part of my problem (apart from not being a Windows programmer) is that I do not quite understand where the deadlock would occur - when it should only ever be the master thread waiting on others to exit. From my very limited understanding, perhaps using WaitForMultipleObjects with the pool.killed event as the second object (to avoid an infinite wait time) and then instead of the evil TerminateThread just doing a second round of WaitForMultipleObjects for anything not reaped on the first try might work ?

@Jehan
Copy link

Jehan commented Jan 1, 2020

So it turns out our last issue was on our side (some shaddy issue of wrong CFLAGS or something; we are not even sure ourselves). The code as it is currently works for us.

So I would say, in any case, please don't merge this patch (as it just locks GIMP forever). I understand what you are saying @Keno, and in theory you are right that you should not assume that a long wait means there is a problem (maybe it's an expected long wait) so an infinite timeout could theoretically be right.
Yet in practice (as can be seen by our practical case), this is a very bad idea IMO. Yes it probably means there is another bug and you should definitely hunt this root bug. But anyway there may always be other bugs (we all make bugs!), so I think that if possible a library should never have any kind of code which may lock the process forever by design (unless this is exposed to the library user; i.e. if there were an API to select the timeout, then on our side, we would set a low timeout and other people would be free to set an INFINITE one).

from #1720 (comment) it is probably the Levin-matting tool in the GIMP's GEGL library that causes a load/unload cycle of OpenBLAS on startup.

I confirm. GIMP doesn't use OpenBLAS directly, only through GEGL, which uses OpenBLAS only for its Levin-matting operation. Since we load GEGL at startup, it is indeed probably where a load/unload of OpenBlas must happen.

@martin-frbg
Copy link
Collaborator

@Jehan can you check if GIMP works with just the dllinit.c part of Keno's patch (which supposedly does not change behaviour on FreeLibrary) ? At the very least I would like to avoid turning this into a popularity contest between Julia and GIMP...

@Jehan
Copy link

Jehan commented Jan 1, 2020

@martin-frbg

At the very least I would like to avoid turning this into a popularity contest between Julia and GIMP...

Obviously I agree. I am all for all Free Software to improve together. 🙂

can you check if GIMP works with just the dllinit.c part of Keno's patch (which supposedly does not change behaviour on FreeLibrary) ?

I guess it will because this was not part of our problem (but that's just a very wild guess). Anyway this should actually be asked to @jernejs. He is the one who maintains our Windows installer/build.

@jernejs Would it be possible to make this test when you have some time?

@vtjnash
Copy link
Contributor

vtjnash commented Jan 2, 2020

Part of my problem (apart from not being a Windows programmer)

That seems like the opposite of a problem :P I’m cursed with the opposite affliction: having only passing knowledge of the OpenBLAS scheduler.

But to summarize, that WaitForObject call is actually (guaranteed) to behave the same as calling Sleep(nthreads * TIMEOUT) because it’s happening inside the DllMain lock (each thread must acquire that lock before it can exit). So 50 ms isn’t great for anyone, but 5s or INFINITE are clearly worse. The workaround Microsoft suggests is to wait on some other object. AFAICT, there’s no existing signal to indicate that the thread has finished executing blas_memory_free, and is now waiting for the DllMain lock (currently running on thread one) to return so that it can also exit and set its state to signaled.

staticfloat pushed a commit to JuliaLang/julia that referenced this pull request Jan 3, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
@martin-frbg
Copy link
Collaborator

After reading https://docs.microsoft.com/de-de/windows/win32/dlls/dynamic-link-library-best-practices it seems clear to me that the first part of this PR (to not attempt to shut down threads on process exit) is correct. I am even more confused about what to do in the case of an immediate DLL unload while threads are still barely starting up - the available documentation appears to do little more than pronounce this a fundamentally unsafe situation and enumerate
possible reasons for a subsequent crash or deadlock. Perhaps starting the blas_thread_shutdown() function with a few seconds sleep() might help to reduce the risk ??

@martin-frbg
Copy link
Collaborator

Furthermore, from https://devblogs.microsoft.com/oldnewthing/?p=25283 (last paragraph) it appears that to "properly wait until the thread finishes in dll unload" is not possible when that wait is initiated from DllMain (thread cannot exit before it has sent DLL_THREAD_DETACH, which it cannot do while the DLL_PROCESS_DETACH handler is still running). So blas_thread_shutdown() in any of its recent forms would probably be safe to invoke from the "other" code before it tries to unload the openblas dll (and also as part of our internal fork handler if necessary), but should not be called during any form of complete shutdown, be it dll unload or program exit. Not sure if we could get away with just doing nothing on DLL_PROCESS_DETACH in the unload case as well,
or if this would cause a leak in the GIMP scenario ?

@vtjnash
Copy link
Contributor

vtjnash commented Jan 5, 2020

Calling sleep won’t help, because the other threads are waiting on the lock currently held by the main thread.

Ignoring it on program exit should be fine, since the other threads should be already killed anyways.

Ignoring it on dll unload might be OK, but you’d have to leak any resources that could ever be examined by another thread (because you don’t know if it’s simultaneously accessing it). It’s unclear to me if Windows will leak the DLL memory, if it it’ll unmap that (causing the other threads to fault if they try to run after shutdown)

I thought of one possible solution, but I’m surprised no other post has mentioned this (though it is tricky). Since the issue is that the LoadLibrary ref count might reach zero before all worker threads have started, the startup routine could acquire an extra handle to itself (increment the ref count). The last worker to start could then release (FreeLibrary) that extra ref count. That would defer the free until all threads are out of the loader lock. A few complications though:

  • LoadLibrary needs to acquire the loader lock, so that might just hang and might not be usable
  • The FreeLibrary call (final DllMain call) might now happen on one of the worker threads (so returning from that call would fault), so it actually needs to call FreeLibraryAndExit thread from yet another thread spawned from that point
  • Alternatively, there’s a flag to disable FreeLibrary entirely (GET_MODULE_HANDLE_EX_FLAG_PIN), but again, unclear when you can call this without hanging for it to (re)acquire the loader lock
  • But if those function hang, then it could instead poke the memory directly to change the ref count??

@martin-frbg
Copy link
Collaborator

Hmm. Looks like a rather involved way to invoke undefined behaviour ? (Though I saw something similar discussed on stackoverflow https://stackoverflow.com/questions/39241400/cleaning-up-threads-in-a-dll-endthreadex-vs-terminatethread)

@martin-frbg
Copy link
Collaborator

A quick test (replacing the openblas.dll in the current official GIMP package on a Windows system) does not show any obvious leaks or other shortcomings from not attempting any cleanup in the DLL unload case - certainly nothing catastrophic.

@vtjnash
Copy link
Contributor

vtjnash commented Jan 7, 2020

I don't think any of the behavior I mentioned was undefined, I just noted that some of the functions may require the loader lock, which could be guaranteed to hang. From Google, it appears that others have used GetModuleHandleExW(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, hInst, &hMod) here. The free-in-a-worker-thread trick is then also fairly straightforward: in the last worker thread to start up, schedule a call on the threadpool to run Free­Library­When­Callback­Returns. I think this is somewhat mentioned at https://devblogs.microsoft.com/oldnewthing/20151225-00/?p=92711 and https://devblogs.microsoft.com/oldnewthing/20131105-00/?p=2733 and https://devblogs.microsoft.com/oldnewthing/20131107-00/?p=2713, but I don't see him spell it out?

PS. Here's the text of Daniel's comment from that last link detailing the race condition present in that solution (while the direct blog link is now dead, it was recoverable on the wayback machine):

The code for ProxyFreeLibraryAndExitThread actually contains a serious race condition
If the two last threads call it at the same time, its possible that:
  1. thread A first calls InterlockedExchange (receives 1), so goes to the ExitThread branch.
  2. before it goes any further thread B gets active goes to InterlockedExchange, receives 0 and call FreeLibraryAndExitThread
  3. Thread A now would like to continue, but it's code has been dropped in the meantime...

However, this should not be a problem given the plan I outlined above.

@martin-frbg
Copy link
Collaborator

To clarify I am not claiming that what you outlined would invoke "defined" undefined behaviour, just that there does not seem to be any official documentation stating that this is the recommended, guaranteed way to address this problem. So we might again end up with something that happens to work right now but causes insiduous problems with some later version/patchlevel of windows ?
(Of course your solution would be much more elegant than the trivial approach of just doing nothing, hoping that windows does the right thing)

@vtjnash
Copy link
Contributor

vtjnash commented Jan 8, 2020

Not doing anything is relying on a race condition (that the worker threads are currently sleeping and will never awake after the DLL is unmapped from memory—that's nearly the same one that's sometimes seen to fail now). The approach I outlined should only be relying on simple defined behaviors (reference counting) and is presumably also the underlying implementation for the Win32 ThreadPool https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setthreadpoolcallbacklibrary

KristofferC pushed a commit to JuliaLang/julia that referenced this pull request Jan 15, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
KristofferC pushed a commit to JuliaLang/julia that referenced this pull request Jan 17, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
staticfloat pushed a commit to JuliaLang/julia that referenced this pull request Jan 17, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
staticfloat pushed a commit to JuliaLang/julia that referenced this pull request Jan 18, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
KristofferC pushed a commit to JuliaLang/julia that referenced this pull request Apr 11, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
KristofferC pushed a commit to JuliaLang/julia that referenced this pull request Apr 11, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
BioTurboNick pushed a commit to BioTurboNick/julia that referenced this pull request Apr 13, 2020
After much debugging, it turns out that OpenBlas is sometimes hanging
on exit due to bad assumptions in its thread management. The upstream
discussion is at OpenMathLib/OpenBLAS#2350, but this
should fix our frequent win32 CI failures in the meantime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants