-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
julia-rc3 & rc4 binary segfault with enough processes #18477
Comments
can you try to get a backtrace in gdb? and try under julia-debug if that gives any different behavior or level of information in the backtrace |
Turns out the segfault only happens when another julia process runs with several processes. Running julia in one shell with
Also, running
then hangs (for at least 10s of seconds), hitting ctrl-C produced:
Going through the same procedure (waiting, then hitting ctrl-c) with gdb produces:
|
This actually happens on the Intel machine after all too, but with different number of procs needed to trigger it. Also another observation, this now seems to happen only on the Intel machine: I run julia-0.4 (which runs fine as far as I can tell) with The IT support for those machines is very helpful, so let me know if I should ask them anything. |
This basically means you are running too many processes with the address space quota. |
Any way to make the failure less alarming, and more descriptive about what the problem was? |
No. But you can try diff --git a/src/gc-pages.c b/src/gc-pages.c
index fbe2b27..90be548 100644
--- a/src/gc-pages.c
+++ b/src/gc-pages.c
@@ -45,6 +45,7 @@ void jl_gc_init_page(void)
// Return `NULL` if allocation failed. Result is aligned to `GC_PAGE_SZ`.
static char *jl_gc_try_alloc_region(int pg_cnt)
{
+ jl_safe_printf("Try allocating %d pages\n");
const size_t pages_sz = sizeof(jl_gc_page_t) * pg_cnt;
const size_t freemap_sz = sizeof(uint32_t) * pg_cnt / 32;
const size_t meta_sz = sizeof(jl_gc_pagemeta_t) * pg_cnt; to see how much it's trying to allocate. |
But then why is this a problem with 0.5-rc4 but not with 0.4.6? Or at least the problem on 0.4.6 occurs much later. |
Because you are not applying the patch to pessimistically use a tiny page size on 0.5. |
Ah, my understanding was that this was fixed with #16385. |
It still need to pick a size and if you have a total quota shared between multiple processes it would have to be very pessimistic in order to not hit the limit. |
For 0.4 I used |
Cool, changing However, is this a new issue or is #10390 not resolved? Also, the error is now different to before. Now running the |
Dup of #17987 |
Can you please explain why not? At least with the older error it was obvious what was happening. |
All the related allocations are checked afaict and if it still segfaults, there's nothing we can do to figure out why. |
Is the hang when there are enough processors an error handling issue in the parallel code then? |
On a linux machine with AMD processor the binaries of 0.5-rc3 & rc4 both segfault on startup:
On a different machine with a Intel CPU (sharing the file system) the same binary starts and tests work in serial (but not in parallel, but that is probably a different issue). (Edit: actually not true, see below).
Also, I'm currently building from source but also only the serial tests run. (Note for the 0.4 build I had to #10390 (comment), other than that 0.4 worked fine)
The text was updated successfully, but these errors were encountered: