Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 musl binaries panic since 2018-02-05 nightly #48967

Closed
tjkirch opened this issue Mar 12, 2018 · 41 comments · Fixed by #52087
Closed

aarch64 musl binaries panic since 2018-02-05 nightly #48967

tjkirch opened this issue Mar 12, 2018 · 41 comments · Fixed by #52087
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state O-musl Target: The musl libc P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tjkirch
Copy link

tjkirch commented Mar 12, 2018

aarch64-unknown-linux-musl binaries crash immediately when built using Rust nightly since 2018-02-05, including the current beta, 1.25.0-beta.9.

This happens in debug and release builds.

It works fine with 2018-02-04, or with stable Rust 1.24.1. This is building on an x86_64-unknown-linux-gnu host, which shows no errors or warnings, and running on an embedded Linux 4.9 device.

I tried this code:

A fresh cargo new --bin testme. (Same thing with a completely empty "main {}" function.)

I expected to see this happen:

The binary should run without panicking. With nightly 2018-02-04, it looks like this, through strace:

execve("/tmp/testme", ["/tmp/testme"], 0x7fe28422a0 /* 9 vars */) = 0
mmap(NULL, 448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f87569000
set_tid_address(0x7f87569038)           = 19524
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x425660}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RT_1 RT_2], NULL, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x408844, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO|SA_ONSTACK, sa_restorer=0x425660}, NULL, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x408844, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO|SA_ONSTACK, sa_restorer=0x425660}, NULL, 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f87566000
Hello, world!
sigaltstack({ss_sp=0x7f87566000, ss_flags=0, ss_size=12288}, NULL) = 0
brk(NULL)                               = 0x451000
brk(0x452000)                           = 0x452000
write(1, "Hello, world!\n", 14)         = 14
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=12288}, NULL) = 0
munmap(0x7f87566000, 12288)             = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Instead, with 2018-02-05+, this happened:

thread panicked while processing panic. aborting.
Trace/breakpoint trap (core dumped)

With strace:

execve("/tmp/testme", ["/tmp/testme"], 0x7fe4522920 /* 9 vars */) = 0
mmap(NULL, 520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f82fb0000
set_tid_address(0x7f82fb0040)           = 6644
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x424780}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RT_1 RT_2], NULL, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x40cad8, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO|SA_ONSTACK, sa_restorer=0x424780}, NULL, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x40cad8, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO|SA_ONSTACK, sa_restorer=0x424780}, NULL, 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f82fad000
sigaltstack({ss_sp=0x7f82fad000, ss_flags=0, ss_size=12288}, NULL) = 0
brk(NULL)                               = 0x44f000
brk(0x450000)                           = 0x450000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
write(2, "thread panicked while processing panic. aborting.\n", 50thread panicked while processing panic. aborting.
) = 50
--- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=4242620, si_uid=0} ---
+++ killed by SIGTRAP (core dumped) +++
Trace/breakpoint trap (core dumped)

Here's the backtrace from the core, using gdb 7.1.2:

(gdb) bt
#0  std::panicking::rust_panic_with_hook::h32ba6c175ffd549b () at libstd/panicking.rs:562
#1  0x000000000040bbb4 in std::panicking::begin_panic::h54415e6a8d568874 () at libstd/panicking.rs:537
#2  0x000000000040bb24 in std::panicking::begin_panic_fmt::he7ac75f1ed16a44d () at libstd/panicking.rs:521
#3  0x000000000040bab8 in rust_begin_unwind () at libstd/panicking.rs:497
#4  0x0000000000425740 in core::panicking::panic_fmt::he44a5deca1206c59 () at libcore/panicking.rs:71
#5  0x0000000000401e34 in core::result::unwrap_failed::h3872e2fc78d2079d () at /checkout/src/libcore/macros.rs:23
#6  0x000000000040316c in _$LT$core..result..Result$LT$T$C$$u20$E$GT$$GT$::expect::hd60da7509cc07397 () at /checkout/src/libcore/result.rs:809
#7  _$LT$core..cell..RefCell$LT$T$GT$$GT$::borrow::hb3e1337e98c3fd8c () at /checkout/src/libcore/cell.rs:692
#8  std::sys_common::thread_info::ThreadInfo::with::_$u7b$$u7b$closure$u7d$$u7d$::h8953affb3be93291 () at libstd/sys_common/thread_info.rs:26
#9  _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::h26bdeb702705cd55 () at libstd/thread/local.rs:377
#10 0x000000000040b3d0 in std::sys_common::thread_info::ThreadInfo::with::hea74330e81f8e238 () at libstd/sys_common/thread_info.rs:25
#11 std::sys_common::thread_info::current_thread::h1ca059562bf90a53 () at libstd/sys_common/thread_info.rs:38
#12 std::panicking::default_hook::h1c4df4ccc9dbcc8f () at libstd/panicking.rs:366
#13 0x000000000040bd5c in std::panicking::rust_panic_with_hook::h32ba6c175ffd549b () at libstd/panicking.rs:576
#14 0x000000000040bbb4 in std::panicking::begin_panic::h54415e6a8d568874 () at libstd/panicking.rs:537
#15 0x000000000040bb24 in std::panicking::begin_panic_fmt::he7ac75f1ed16a44d () at libstd/panicking.rs:521
#16 0x000000000040bab8 in rust_begin_unwind () at libstd/panicking.rs:497
#17 0x0000000000425740 in core::panicking::panic_fmt::he44a5deca1206c59 () at libcore/panicking.rs:71
#18 0x0000000000401e34 in core::result::unwrap_failed::h3872e2fc78d2079d () at /checkout/src/libcore/macros.rs:23
#19 0x000000000040cf80 in _$LT$core..result..Result$LT$T$C$$u20$E$GT$$GT$::expect::hd60da7509cc07397 () at /checkout/src/libcore/result.rs:809
#20 _$LT$core..cell..RefCell$LT$T$GT$$GT$::borrow::hb3e1337e98c3fd8c () at /checkout/src/libcore/cell.rs:692
#21 std::sys_common::thread_info::ThreadInfo::with::_$u7b$$u7b$closure$u7d$$u7d$::h9b784f47e02471dc () at libstd/sys_common/thread_info.rs:26
#22 _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::h679d37f9412a06de () at libstd/thread/local.rs:377
#23 std::sys_common::thread_info::ThreadInfo::with::hbe63383438d236c7 () at libstd/sys_common/thread_info.rs:25
#24 std::sys_common::thread_info::stack_guard::h9e3b15191b000f90 () at libstd/sys_common/thread_info.rs:42
#25 std::sys::unix::stack_overflow::imp::signal_handler::h7311ce70e0648552 () at libstd/sys/unix/stack_overflow.rs:105
#26 <signal handler called>
#27 core::sync::atomic::atomic_add::h28be0c5b8884ec73 () at /checkout/src/libcore/sync/atomic.rs:1529
#28 core::sync::atomic::AtomicUsize::fetch_add::h6d05bf1ec9fefec2 () at /checkout/src/libcore/sync/atomic.rs:1285
#29 _$LT$alloc..arc..Arc$LT$T$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h6010f845e6b4f7cd () at /checkout/src/liballoc/arc.rs:713
#30 _$LT$std..thread..Thread$u20$as$u20$core..clone..Clone$GT$::clone::h676320869dc42c9e () at libstd/thread/mod.rs:995
#31 std::sys_common::thread_info::current_thread::_$u7b$$u7b$closure$u7d$$u7d$::h9421a40bb49e2782 () at libstd/sys_common/thread_info.rs:38
#32 std::sys_common::thread_info::ThreadInfo::with::_$u7b$$u7b$closure$u7d$$u7d$::h8953affb3be93291 () at libstd/sys_common/thread_info.rs:32
#33 _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::h26bdeb702705cd55 () at libstd/thread/local.rs:377
#34 0x000000000040b3d0 in std::sys_common::thread_info::ThreadInfo::with::hea74330e81f8e238 () at libstd/sys_common/thread_info.rs:25
#35 std::sys_common::thread_info::current_thread::h1ca059562bf90a53 () at libstd/sys_common/thread_info.rs:38
#36 std::panicking::default_hook::h1c4df4ccc9dbcc8f () at libstd/panicking.rs:366
#37 0x000000000040bd5c in std::panicking::rust_panic_with_hook::h32ba6c175ffd549b () at libstd/panicking.rs:576
#38 0x000000000040bc1c in std::panicking::begin_panic::hd820cd84b7494cee () at libstd/panicking.rs:537
#39 0x000000000040a854 in std::sys_common::thread_info::set::_$u7b$$u7b$closure$u7d$$u7d$::he4d3a0675b98919e () at libstd/sys_common/thread_info.rs:46
#40 _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::ha14bf0d7b1a87bd1 () at libstd/thread/local.rs:377
#41 _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::with::he00cc2566a9cc88e () at libstd/thread/local.rs:288
#42 std::sys_common::thread_info::set::he3dc668f6080c88a () at libstd/sys_common/thread_info.rs:46
#43 0x000000000040bfa4 in std::rt::lang_start_internal::h60dd5b329f127bc1 () at libstd/rt.rs:51
#44 0x0000000000400304 in main ()

Meta

rustc --version --verbose:

Any nightly Rust since 2018-02-05, installed through rustup. Example:

rustc 1.25.0-nightly (0c6091fbd 2018-02-04)
binary: rustc
commit-hash: 0c6091fbd0eee290c651f73be899f221eeab3c05
commit-date: 2018-02-04
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

My ~/.cargo/config:

[target.aarch64-unknown-linux-musl]
linker = "aarch64-unknown-linux-musl-gcc"
rustflags = [
  "-C", "link-arg=-lgcc",
  "-C", "target-feature=+crt-static"
]

aarch64-unknown-linux-musl-gcc is from GCC 7.2.0 via Buildroot 2017.08.

Note: I've tried adding -C llvm-args=-fast-isel per #48673 but it made no difference.

@kennytm kennytm added O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state regression-from-stable-to-beta Performance or correctness regression from stable to beta. O-musl Target: The musl libc C-bug Category: This is a bug. labels Mar 12, 2018
@tjkirch
Copy link
Author

tjkirch commented Mar 12, 2018

I see the libc repo had a similar-sounding issue, but it was back in November 2017, and Rust builds have been working for me until February 5 2018. It may have prevented more visibility, though. See: rust-lang/libc#856 and rust-lang/libc@bea4879eec9a1

@pietroalbini pietroalbini added this to the 1.25 milestone Mar 13, 2018
@pietroalbini pietroalbini added I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 13, 2018
@tjkirch
Copy link
Author

tjkirch commented Mar 13, 2018

A colleague found that dynamically linking libgcc (and libc) using "target-feature=-crt-static" works around this issue.

@parched
Copy link
Contributor

parched commented Mar 14, 2018

I wonder if it's related to #46566

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Mar 15, 2018

If indeed it worked in 2018-02-04 and failed in 2018-02-05, then the suspicious commit range is 3d292b7 .. 0c6091f (commits), presuming I am interpreting this rustup output correctly

> rustup install nightly-2018-02-04
info: syncing channel updates for 'nightly-2018-02-04-x86_64-unknown-linux-gnu'

  nightly-2018-02-04-x86_64-unknown-linux-gnu unchanged - rustc 1.25.0-nightly (3d292b793 2018-02-03)

> rustup install nightly-2018-02-05
info: syncing channel updates for 'nightly-2018-02-05-x86_64-unknown-linux-gnu'

  nightly-2018-02-05-x86_64-unknown-linux-gnu unchanged - rustc 1.25.0-nightly (0c6091fbd 2018-02-04)

@nikomatsakis
Copy link
Contributor

@tjkirch can you validate the commits from the two nightlies? That commit range looks a bit suspicious. Just include the -vV output from each rustc that you tested with.

@nikomatsakis
Copy link
Contributor

triage: P-high

We should figure out what is happening here.

@rust-highfive rust-highfive added P-high High priority and removed I-nominated labels Mar 15, 2018
@nikomatsakis nikomatsakis self-assigned this Mar 15, 2018
@tjkirch
Copy link
Author

tjkirch commented Mar 15, 2018

@nikomatsakis Sure! Here are the outputs.

Working 2018-02-04 nightly:

$ rustup run nightly-2018-02-04-x86_64-unknown-linux-gnu rustc -vV
rustc 1.25.0-nightly (3d292b793 2018-02-03)
binary: rustc
commit-hash: 3d292b793ade0c1c9098fb32586033d79f6e9969
commit-date: 2018-02-03
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

Non-working 2018-02-05 nightly:

$ rustup run nightly-2018-02-05-x86_64-unknown-linux-gnu rustc -vV
rustc 1.25.0-nightly (0c6091fbd 2018-02-04)
binary: rustc
commit-hash: 0c6091fbd0eee290c651f73be899f221eeab3c05
commit-date: 2018-02-04
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

I also bisected with the beta releases using rustup and found that 2018-02-13 (1.24.0-beta.12) works, but the next beta 2018-02-20 (1.25.0-beta.2) crashes.

Last working beta:

$ rustup run beta-2018-02-13 rustc -vV
rustc 1.24.0-beta.12 (ed2c0f084 2018-02-12)
binary: rustc
commit-hash: ed2c0f08442915c628fc855e6a784c5979a4dc83
commit-date: 2018-02-12
host: x86_64-unknown-linux-gnu
release: 1.24.0-beta.12
LLVM version: 4.0

First crashing beta:

$ rustup run beta-2018-02-20 rustc -vV
rustc 1.25.0-beta.2 (1e8fbb143 2018-02-19)
binary: rustc
commit-hash: 1e8fbb1432cc124ba6687c95dc64ed5d21156d6e
commit-date: 2018-02-19
host: x86_64-unknown-linux-gnu
release: 1.25.0-beta.2
LLVM version: 6.0

I also confirmed that the just-released 1.25.0-beta.10 and nightly-2018-03-14 still crash.

My process is basically to rustup toolchain install the version, rustup default the version, rustup target add aarch64-unknown-linux-musl, and cargo build --target aarch64-unknown-linux-musl --release. (I used rustup default because I didn't know target add had a --toolchain argument until just now.)

@tjkirch
Copy link
Author

tjkirch commented Mar 15, 2018

Now I see what Niko meant about a potentially suspicious commit range... aside from an RLS/rustfmt change from #47991, there were only two merges: #47915 (which frankly I don't understand :)) and #47834 to disable ThinLTO.

@kennytm
Copy link
Member

kennytm commented Mar 15, 2018

@tjkirch Could you check which one of the three PRs are the cause?

  1. Install the nightly toolchain

  2. Install rustup-toolchain-install-master

    cargo install rustup-toolchain-install-master
  3. Install the build artifacts from these three commits in 3d292b7...0c6091f

    rustup-toolchain-install-master \
        9af374abf9d41c533afa46e62e1047097c190445 \
        3986539df6eb3601cbd4e9c6c195583fca6dc10b \
        0c6091fbd0eee290c651f73be899f221eeab3c05 \
        -t aarch64-unknown-linux-musl
  4. Check for each PR with

    cargo +9af374abf9d41c533afa46e62e1047097c190445 build \
        --target aarch64-unknown-linux-musl --release

@tjkirch
Copy link
Author

tjkirch commented Mar 15, 2018

@kennytm Sure, thanks for making the tool!

9af374a: works.
3986539: works.
0c6091f: crashes.

That's the ThinLTO change from #47834 fixing #45444.

Please let me know any other information I can provide!

@kennytm
Copy link
Member

kennytm commented Mar 15, 2018

Thanks! Looks like another trusting-trust issue then 🤷.

@tjkirch
Copy link
Author

tjkirch commented Mar 15, 2018

I tried with "-Z", "thinlto=no" and with "-Z", "thinlto=yes" in my ~/.cargo/config's rustflags, but it crashed either way.

@nikomatsakis
Copy link
Contributor

Fascinating. That was our guess from the compiler team meeting, though it seemed unlikely as that PR ought to improve reliability in general.

@nikomatsakis
Copy link
Contributor

I could use any suggestions for how to reproduce this problem =) Some have suggested qemu?

That said, I'm not sure where to start debugging this. Seems...likely, possible?...to be an LLVM problem? I'm sort of hoping that one of the LLVM upgrades will make it go away. =)

In any case, reproducing it would be a start.

@tjkirch
Copy link
Author

tjkirch commented Mar 16, 2018

Yeah, I think qemu-aarch64 is the way to go, and perhaps the "bleeding edge" aarch64/musl toolchain from https://toolchains.bootlin.com/ -- I'll see if I can build a repro using those tools.

@tjkirch
Copy link
Author

tjkirch commented Mar 16, 2018

@nikomatsakis Here's how I reproduced from scratch:

  • I used a Fedora 27 EC2 instance, though anything recent with qemu should be fine - https://alt.fedoraproject.org/cloud/
  • sudo yum -y install qemu-user
  • I downloaded the aarch64/musl "bleeding edge" toolchain from https://toolchains.bootlin.com/ and extracted it in the home directory, to get the linker:
    • curl -O https://toolchains.bootlin.com/downloads/releases/toolchains/aarch64/tarballs/aarch64--musl--bleeding-edge-2018.02-1.tar.bz2
    • tar xf aarch64--musl--bleeding-edge-2018.02-1.tar.bz2
  • I set ~/.cargo/config to:
[target.aarch64-unknown-linux-musl]
linker = "/home/fedora/aarch64--musl--bleeding-edge-2018.02-1/bin/aarch64-buildroot-linux-musl-gcc"
rustflags = [
  "-C", "link-arg=-lgcc",
  "-C", "target-feature=+crt-static",
]
  • Install rustup using curl https://sh.rustup.rs -sSf | sh && source $HOME/.cargo/env
  • Install toolchains and targets:
    • rustup toolchain install nightly-2018-02-04 nightly-2018-02-05
    • rustup target add aarch64-unknown-linux-musl --toolchain nightly-2018-02-04
    • rustup target add aarch64-unknown-linux-musl --toolchain nightly-2018-02-05
  • Hello world: cargo new --bin testme && cd testme
  • Observe success:
    • cargo +nightly-2018-02-04 build --target aarch64-unknown-linux-musl --release
    • qemu-aarch64 target/aarch64-unknown-linux-musl/release/testme
  • Observe crash:
    • cargo +nightly-2018-02-05 build --target aarch64-unknown-linux-musl --release
    • qemu-aarch64 target/aarch64-unknown-linux-musl/release/testme

@parched
Copy link
Contributor

parched commented Mar 17, 2018

This appears to be fixed again on nightly-2018-03-16, I guess because of #48892?

@tjkirch
Copy link
Author

tjkirch commented Mar 18, 2018

I can confirm it started working for me as well in the latest nightly! This is with nightly-2018-03-17; nightly-2018-03-16 still crashed. To be specific:

Crashes:

$ rustup run nightly-2018-03-16 rustc -vV
rustc 1.26.0-nightly (392645394 2018-03-15)
binary: rustc
commit-hash: 39264539448e7ec5e98067859db71685393a4464
commit-date: 2018-03-15
host: x86_64-unknown-linux-gnu
release: 1.26.0-nightly
LLVM version: 6.0

Works!

$ rustup run nightly-2018-03-17 rustc -vV
rustc 1.26.0-nightly (55c984ee5 2018-03-16)
binary: rustc
commit-hash: 55c984ee5db73db2379024951457d1139db57f24
commit-date: 2018-03-16
host: x86_64-unknown-linux-gnu
release: 1.26.0-nightly
LLVM version: 6.0

I'll work on bisecting to pinpoint the merge that fixed it. Don't want this to reoccur!

@tjkirch
Copy link
Author

tjkirch commented Mar 18, 2018

Unfortunately rustup-toolchain-install-master wasn't able to fetch the artifacts for all the intervening commits; I'm guessing the artifacts either aren't uploaded for every build, they're just not uploaded yet, or some commits were built together. Anyway, it seems to work with all of these commits, but I'm not confident without being able to test all of them and confirm the negative case.

55c984e
3b6412b
cc34ca1
5f3996c
a7170b0
36b6687

@nikomatsakis
Copy link
Contributor

So it seems clear that I am not the best person to investigate this, and I've also got a few other things to look into right now, so I don't really have time. I'm not sure who is, but we've got a lot of data collected. So the plan is that I (or someone else) will summarize the current state:

  • This is how to reproduce:
    • it works with this rustc, fails with that one
    • what commits are in that range

and we'll put out a call and see if anyone can help us narrow down the problem. I suspect some kind of LLVM bug here still and it'd be nice to know what to pin it on.

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2018

Looking at the backtrace, it seems that this issue is related to thread-local storage. It seems that this function is causing a segfault:

#35 std::sys_common::thread_info::current_thread::h1ca059562bf90a53 () at libstd/sys_common/thread_info.rs:38

I will have a look at the disassembly & relocations generated for that TLS access, I think something strange might be happening at LTO/link time.

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2018

OK, so I think that I've found the source of the bug:

  402af0:	d2a00000 	movz	x0, #0x0, lsl #16
  402af4:	f2800400 	movk	x0, #0x20 <-- THIS SHOULD BE 0x10
  402af8:	d503201f 	nop
  402afc:	d503201f 	nop
  402b00:	d53bd048 	mrs	x8, tpidr_el0
  402b04:	8b000108 	add	x8, x8, x0
  402b08:	f9400d08 	ldr	x8, [x8, #24]

For some reason, the addresses of all TLS variables are offset by an additional 0x10. This behavior happens in nightly-2018-02-05 (broken) but not in nightly-2018-02-04 (good).

I think this may have gone unnoticed in the past since all TLS was shifted by 0x10, and the TLS was zero-initialized. In this specific case, one of the bytes of the TLS data has an initial value of 0x3, but due to the 0x10 shift it is accessed at the wrong offset by the program.

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2018

Now I'm completely stumped as to what actually caused this bug. It seems like a bug in LLVM rather than rustc, possibly related to LTO since the linker is getting confused about TLS offsets.

@pnkfelix
Copy link
Member

visited for triage. It seems we haven't made progress since the last report. I am wondering whether we can enlist someone to act as a local "LLVM LTO bug identification" expert...

@kennytm kennytm added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-thread-locals Area: Thread local storage (TLS) labels Apr 19, 2018
@nikomatsakis
Copy link
Contributor

@Amanieu can we confirm that it is an LTO problem?

@Amanieu
Copy link
Member

Amanieu commented Apr 26, 2018

To me it looks like a linker bug: the TLS relocations are being resolved to the wrong value. Since it is very unlikely that the linker has been broken this whole time, I would blame it on LTO somehow interfering with the linker.

@nikomatsakis
Copy link
Contributor

@Amanieu -- question: do you think you can narrow this down to just LLVM IR inputs that reflect the error, so we can open a bug on the LLVM side?

@nikomatsakis
Copy link
Contributor

triage: P-medium

Next steps are to diagnose the LLVM problem. Filing under #50422.

@nikomatsakis nikomatsakis removed their assignment May 3, 2018
@nikomatsakis nikomatsakis added P-medium Medium priority and removed P-high High priority labels May 3, 2018
@Amanieu
Copy link
Member

Amanieu commented May 3, 2018

I have a minimal reproduction:

#![feature(libc, thread_local, asm)]
#![no_main]
extern crate libc;

#[thread_local]
static mut ASDF: u8 = 74;

#[inline(never)]
fn get_tls_val() -> i32 {
    // The asm here is just to prevent the TLS access from being optimized away
    unsafe {
        let out: &u8;
        asm!("" : "=r" (out) : "0" (&ASDF));
        *out as i32
    }
}


#[no_mangle]
pub unsafe extern fn main() -> i32 {
    let val = get_tls_val();
    libc::printf(b"%d\n\0".as_ptr(), val);

    // UNCOMMENT THIS LINE TO TRIGGER THE BUG
    //std::thread::sleep_ms(1);

    0
}

The bug only seems to trigger when libstd is linked into the final binary. The expected output is 74, which is the initial value of the TLS variable. However when libstd is linked in, the output is 0 because the TLS offsets are incorrect.

Bad version:

0000000000400268 <hello::get_tls_val>:
  400268:       d10043ff        sub     sp, sp, #0x10
  40026c:       d53bd048        mrs     x8, tpidr_el0
  400270:       91400108        add     x8, x8, #0x0, lsl #12
  400274:       91008108        add     x8, x8, #0x20 <-----
  400278:       f90007e8        str     x8, [sp, #8]
  40027c:       f94007e8        ldr     x8, [sp, #8]
  400280:       39400100        ldrb    w0, [x8]
  400284:       910043ff        add     sp, sp, #0x10
  400288:       d65f03c0        ret

Good version:

0000000000400268 <hello::get_tls_val>:
  400268:       d10043ff        sub     sp, sp, #0x10
  40026c:       d53bd048        mrs     x8, tpidr_el0
  400270:       91400108        add     x8, x8, #0x0, lsl #12
  400274:       91004108        add     x8, x8, #0x10 <-----
  400278:       f90007e8        str     x8, [sp, #8]
  40027c:       f94007e8        ldr     x8, [sp, #8]
  400280:       39400100        ldrb    w0, [x8]
  400284:       910043ff        add     sp, sp, #0x10
  400288:       d65f03c0        ret

@Amanieu
Copy link
Member

Amanieu commented May 3, 2018

Switching the linker between bfd, gold and lld doesn't seem to make any difference.

@tjkirch
Copy link
Author

tjkirch commented Jun 18, 2018

There's a TLS-related fix in musl that applies to aarch64 and some other architectures; it will probably be in the 1.1.20 release. I wonder if it helps with this!

https://git.musl-libc.org/cgit/musl/commit/?id=610c5a8524c3d6cd3ac5a5f1231422e7648a3791

@Amanieu
Copy link
Member

Amanieu commented Jun 18, 2018

That might very well be the solution! I noticed that the .tdata section in libstd was aligned to 32 bytes, which may very well explain why it's not being handled correctly.

So basically, my earlier hypothesis is incorrect: the compiler/linker are calculating the TLS offsets correctly, it's just that musl isn't handling over-aligned TLS sections correctly.

@agend
Copy link

agend commented Jun 19, 2018

When new nightly build expected to appear with this fix? I'm trying to use xargo to get fresh build myself but always run into compilation issues.

@bcressey
Copy link
Contributor

@Amanieu - I've confirmed that your minimal reproduction prints 74 when musl has 610c5a8 applied, and 0 otherwise.

@agend
Copy link

agend commented Jun 26, 2018

I have build my project with rust compiled from sources with fresh musl - and can confirm it works.

@malbarbo
Copy link
Contributor

malbarbo commented Jul 5, 2018

@Amanieu @agend

Are you cross compiling? I'm trying cross compile to test the tls fix but I get the error from #46651 and rust-lang/compiler-builtins#201

@Amanieu
Copy link
Member

Amanieu commented Jul 5, 2018

@malbarbo

Use this command as a workaround:

cargo rustc --target aarch64-unknown-linux-musl -- -C link-arg=-lgcc

bors added a commit that referenced this issue Jul 7, 2018
Update musl to 1.1.19 and add patch to fix tls issue

This fixes #48967
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue Jul 7, 2018
Update musl to 1.1.19 and add patch to fix tls issue

This fixes rust-lang#48967
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state O-musl Target: The musl libc P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.