Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys.so (sysimg) file corruption #8200

Closed
vtjnash opened this issue Sep 1, 2014 · 24 comments
Closed

sys.so (sysimg) file corruption #8200

vtjnash opened this issue Sep 1, 2014 · 24 comments
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@vtjnash
Copy link
Member

vtjnash commented Sep 1, 2014

from the mailing list, but i'm also seeing this on 32-bit linux too

When I updated Julia today on my Mac (10.9.2), I got the following error:

/bin/sh: line 1: 23089 Segmentation fault: 11
/Users/danluu/dev/julia/usr/bin/julia --build
/Users/danluu/dev/julia/usr/lib/julia/sys
-J/Users/danluu/dev/julia/usr/lib/julia/$([ -e
/Users/danluu/dev/julia/usr/lib/julia/sys.ji ] && echo sys.ji || echo
sys0.ji) -f sysimg.jl
* This error is usually fixed by running 'make clean'. If the error
persists, try 'make cleanall'. *
make[1]: * [/Users/danluu/dev/julia/usr/lib/julia/sys.o] Error 1
make: * [release] Error 2

I've tried doing make cleanall, and even wiping out my repository and
re-cloning in case it's a problem with deps, and I still get the same
error.

On Linux (64-bit, 3.2.0-65-generic), the build doesn't error out, but
Julia segfaults on startup. The gdb backtrace for that is:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6e2328c in jl_deserialize_gv (v=0x7bb138, s=0x7fffffffdcc0)
at dump.c:145
145             *sysimg_gvars[gvname_index] = v;
(gdb) bt
#0  0x00007ffff6e2328c in jl_deserialize_gv (v=0x7bb138,
s=0x7fffffffdcc0) at dump.c:145
#1  jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:854
#2  0x00007ffff6e233e5 in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#3  jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:937
#4  0x00007ffff6e2350d in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#5  jl_deserialize_datatype (pos=403560, s=0x7fffffffdcc0) at dump.c:646
#6  jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:886
#7  0x00007ffff6e22818 in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#8  jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:715
...
#134 jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:715
#135 0x00007ffff6e233e5 in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#136 jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:937
#137 0x00007ffff6e233e5 in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#138 jl_deserialize_value_internal (s=0x7fffffffdcc0) at dump.c:937
#139 0x00007ffff6e23881 in jl_deserialize_value (s=0x7fffffffdcc0) at dump.c:950
#140 jl_restore_system_image (fname=<optimized out>) at dump.c:1060
#141 0x00007ffff6e1f33b in julia_init (
    imageFile=0x608e60
"/home/dluu/dev/julia/usr/bin/../lib/julia/sys.ji") at init.c:826
#142 0x000000000040140a in main (argc=0, argv=0x7fffffffe1c0) at repl.c:378
@vtjnash vtjnash added the bug label Sep 1, 2014
@vtjnash
Copy link
Member Author

vtjnash commented Sep 1, 2014

i suspect an issue with the sizeof builtin function pointer

@tkelman
Copy link
Contributor

tkelman commented Sep 1, 2014

As best I can tell, this started happening with the "size 0" commits a few days ago

@catawbasam
Copy link
Contributor

I just saw something similar on a 64-bit linux system while attempting to build a refreshed version of master.
The first time running make after make cleanall I got a segfault at precompile.jl.
After running make a 2nd time, the build appeared to finish, but I got a segfault upon attempting to run julia.

@rickhg12hs
Copy link
Contributor

I saw this or a similar error also on my 32-bit Fedora 19 system ( #8180 ). I don't know the mechanism of failure, but after I cleared my ccache with ccache -C and then rebuilt, everything was fine.

@catawbasam
Copy link
Contributor

AFAIK, I don't have ccache set up -- $ccache returns ccache: Command not found. -- so I don't think that is a factor in my case.

@catawbasam
Copy link
Contributor

I had these 3 lines in my Make.user. Removing the 1st two lines appears to have done the trick:

OPENBLAS_TARGET=NEHALEM
OPENBLAS_DYNAMIC_ARCH = 0
OPENBLAS_USE_THREAD=0

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

I'm also (still) running into this on the latest master. Rebuilding everything (including deps), worked after a few tries, but then I got a test failure:

exception on 4: ERROR: test failed: ((1,2) == (2,2))
 in expression: i7197() == (2,2)
 in error at error.jl:21
 in default_handler at test.jl:25
 in do_test at test.jl:50
 in runtests at /disk2/kevin-src/julia/test/testdefs.jl:5
 in anonymous at multi.jl:855
 in run_work_thunk at multi.jl:621
 in anonymous at task.jl:855
while loading arrayops.jl, in expression starting on line 895
ERROR: test failed: ((1,2) == (2,2))
 in expression: i7197() == (2,2)
 in anonymous at task.jl:1367
while loading arrayops.jl, in expression starting on line 895
while loading /disk2/kevin-src/julia/test/runtests.jl, in expression starting on line 36

Indeed:

julia> function i7197()
           S = [1 2 3; 4 5 6; 7 8 9]
           ind2sub(size(S), 5)
       end
i7197 (generic function with 1 method)

julia> i7197()
(1,2)

julia> S = [1 2 3; 4 5 6; 7 8 9]
3x3 Array{Int64,2}:
 1  2  3
 4  5  6
 7  8  9

julia> ind2sub(size(S), 5)
(2,2)

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

Okay, that last is unrelated. I probably need to update my llvm deps to pull in the correct patch.

@sjkelly
Copy link
Contributor

sjkelly commented Sep 3, 2014

I normally pull, make clean, then make. That lead to a segfault so I did a make cleanall. Didn't work the first time. Since @Keno suggested it I tried again for my own sanity and it worked the second time. i backtracked and AFAIK i didn't do anything different.

Copy pasta of my shell:
https://gist.github.com/sjkelly/ac2364ed37214dcaad91

make testall works too.

EDIT: Grepped commands... https://gist.githubusercontent.com/sjkelly/ac2364ed37214dcaad91/raw/term_grepped.txt

@catawbasam
Copy link
Contributor

Turns out my tweaks to Make.user were not a fix.
I started a branch and got segfault on make clean testall, so went back to master to check that.
On master, make clean+make segfaulted. make clean all appeared to work, and then make clean testall segfaulted again.

@danluu
Copy link
Contributor

danluu commented Sep 3, 2014

This seems like a Makefile dependency issue? This thread, where people do the same thing and get different results, is pretty suspicious, so I tried 3 runs of make -j 1 vs. 3 runs of make -j 5.

All three make -j 1 runs worked (by which I mean they didn't segfault; I didn't any testing other than firing up the REPL). All three make -j 5 runs failed; two segfaulted when I tried to open the REPL, and one segfaulted during the build.

@ihnorton
Copy link
Member

ihnorton commented Sep 3, 2014

-jN only applies to the dependencies and when possible codegen/runtime
build - things in src/. There is no parallelism in the sysimage build.

On Tue, Sep 2, 2014 at 9:13 PM, Dan Luu [email protected] wrote:

Is julia supposed to work when doing a parallel build? This thread, where
people do the same thing and get different results, is pretty suspicious,
so I tried 3 runs of make -j 1 vs. 3 runs of make -j 5.

All three make -j 1 runs worked (by which I mean they didn't segfault; I
didn't any testing other than firing up the REPL). All three make -j 5
runs failed; two segfaulted when I tried to open the REPL, and one
segfaulted during the build.


Reply to this email directly or view it on GitHub
#8200 (comment).

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

In the past, parallel builds haven't bee too much of a problem, meaning
that if they succceeded, Julia ran fine.

There have been some Makefile changes recently, so I wonder if a dependency
or two got lost along the way.

On Tuesday, September 2, 2014, Dan Luu [email protected] wrote:

Is julia supposed to work when doing a parallel build? This thread, where
people do the same thing and get different results, is pretty suspicious,
so I tried 3 runs of make -j 1 vs. 3 runs of make -j 5.

All three make -j 1 runs worked (by which I mean they didn't segfault; I
didn't any testing other than firing up the REPL). All three make -j 5
runs failed; two segfaulted when I tried to open the REPL, and one
segfaulted during the build.


Reply to this email directly or view it on GitHub
#8200 (comment).

@rickhg12hs
Copy link
Contributor

Again, not sure if it's related, but after a recent and small git pull && make && make test-core, it segfaulted. Running strace ./julia seemed to show a problem with usr/lib/julia/sys.ji so I rm usr/lib/julia/sys.* && make and everything seems to work now.

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

For example, in my build, I seem to be missing a patch to LLVM. I haven't
explored fully, but I'm not sure why it isn't being applied.

On Tuesday, September 2, 2014, Kevin Squire [email protected] wrote:

In the past, parallel builds haven't bee too much of a problem, meaning
that if they succceeded, Julia ran fine.

There have been some Makefile changes recently, so I wonder if a
dependency or two got lost along the way.

On Tuesday, September 2, 2014, Dan Luu <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

Is julia supposed to work when doing a parallel build? This thread, where
people do the same thing and get different results, is pretty suspicious,
so I tried 3 runs of make -j 1 vs. 3 runs of make -j 5.

All three make -j 1 runs worked (by which I mean they didn't segfault; I
didn't any testing other than firing up the REPL). All three make -j 5
runs failed; two segfaulted when I tried to open the REPL, and one
segfaulted during the build.


Reply to this email directly or view it on GitHub
#8200 (comment).

@sjkelly
Copy link
Contributor

sjkelly commented Sep 3, 2014

@kmsquire is it #7197 ?

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

Yes. Is there something special I need to do? I built llvm from scratch already.

@sjkelly
Copy link
Contributor

sjkelly commented Sep 3, 2014

@kmsquire make -C deps clean-llvm wasn't working for me. I tend to remember doing an rm -rf * followed by git reset --hard then leaving to bug my office mates :P, but that is not the correct solution.

@tkelman
Copy link
Contributor

tkelman commented Sep 3, 2014

@kmsquire make -C deps distclean-llvm for that one.

This more recent segfault failure is intermittent and strange, and apparently everyone's seeing it. Try running julia-debug, I was getting a few assertion failures in interesting places.

@kmsquire
Copy link
Member

kmsquire commented Sep 3, 2014

Thanks @tkelman and @sjkelly, I'll try the distclean-llvm and then see what else happens.

@ihnorton
Copy link
Member

ihnorton commented Sep 3, 2014

touch deps/llvm-3.3/configure and then make -C deps configure-llvm should force re-application and possibly avoid rebuilding everything.

@nolta
Copy link
Member

nolta commented Sep 3, 2014

I'm seeing this segfault ~50% of the time. Can we temporarily revert #8156 and friends?

@JeffBezanson
Copy link
Member

@vtjnash I believe this is related to DTINSTANCE_PLACEHOLDER, whose assumptions are now violated since a type can have dt->instance set even if it has fields.

@vtjnash
Copy link
Member Author

vtjnash commented Sep 4, 2014

ah. that also happens to be exactly what the existing comment in dump.c says just before starting to deserialize a Type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

10 participants