-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program is hanging with active threadpool even after drop(pool)
?
#776
Comments
It's known that dropping the pool doesn't actually wait for all threads to exit -- see #688. And we never stop the global pool (#483), because I was never able to find a safe way to do so. I haven't thought about it in a while though... Still, I wouldn't expect that to block program exit. Can you attach a debugger and see what it's doing? |
I will look into it (the flamegraph was not useful). It's odd because it sometimes seems to work fine, but certain input fails. Which maybe means there's an unhandled error somewhere that prevents the exit, but I'm not sure where that might be. |
I just ran the program with Going to rerun to confirm this is reproducible. I'll keep looking–not sure if this is a rayon issue or something in that my code that was being hidden. |
That sounds like your current thread was captured at the point which waits for a lock (implemented by futexes), and you just don't have the debug sources for glibc available. No big deal. Try |
Yeah I am re-running (unfortunately debug mode is slower and the run I know reproduces this is large). Also trying to get more error info because I think I'm getting error(s) inside the thread that weren't showing up in logs when in release mode. |
Hm, I'm not sure this is a rayon issue but I'm still seeing it. The output of example of a thread (I believe they all look the same)
Main thread
|
If all the rayon threads look like the example you picked, then they're just idle. For the dropping backtrace in your main thread, if you step through does it make progress? Maybe this is just a lot of nested containers that take a while to drop in debug mode? |
It does seem like it's spending it's time in This program does allocate a very large shared ndarray that all of the threads use. Is it possible that it's having trouble freeing it because it was used by so many threads, even though they should be done? |
If that's one large allocation, I wouldn't think threads would matter much, but it's possible. If it's an ndarray where all the items have allocations, created on many different threads, then this seems even more plausible that it is triggering something bad in the allocator. Either way, it might be worth trying on a different target (different system allocator) or with something like |
It's one big allocation, and then the threads operate on separate chunks of it in parallel. I'm going to try running with |
Update: both of those attempts worked. jemallocator exited cleanly and running the same program on a ppc64le machine was also fine. So I guess it's something to do with how malloc is freeing up memory (or..not doing that). I guess this is probably unrelated to rayon, unless something to do with shared memory access has caused it to lock? It seems like it can't obtain a mutex when it's trying to free. |
Rayon doesn't do anything to "lock" memory, and mere access from separate threads shouldn't be a problem. Is suspect a race condition in the allocation, but even then it's pretty weird given that you allocate it all up front. So I don't know, but this problem doesn't seem to be a rayon issue. Maybe you could try the LLVM address sanitizer, though I think that's only available on nightly x86_64 rustc. I guess if the system allocator is broken, it would also have to be compiled with the sanitizer to find this. Valgrind memcheck is more general, but it's not great with threads, especially the spinning that rayon does in work stealing. I'm going to close here, but good luck! |
Yes closing makes sense and thanks for your help debugging a non-rayon problem! |
I'm not sure if this is a rayon issue or something else but it feels like an issue with the threads not exiting.
I am using rayon as part of a tool for processing sequencing data (code is here). It mostly works, but on occasion the program seems to hang at the end instead of exiting (specifically it's this program that hangs, the others seem okay).
Everything in the code has finished (I see the last logging message, and the output file is present) but the program never exits. When I check
htop
it appears the threadpool is still there, not doing anything, but the main thread is using CPU for something (not sure what, maybe spinning on a lock?).I tried adding an explicit
drop(pool)
to the code and it executes that and still all the threads are around, so I'm not sure what's happening. This program configures global threadpool because I don't want to use all the threads on the machine, so maybe it cannot be explicitly dropped.Has anyone encountered behavior like this? If there's any useful info I can provide let me know. I'm probably doing something wrong here but I'm not sure what. Currently I am re-running with flamegraph to try to get an idea of what it's doing in this mysterious post-run time.
The text was updated successfully, but these errors were encountered: