Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capturing stack backtrace becomes slower and sometimes segfaults on Apple Silicon #104388

Open
Tracked by #6103
skyzh opened this issue Nov 14, 2022 · 17 comments
Open
Tracked by #6103
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. O-AArch64 Armv8-A or later processors in AArch64 mode O-macos Operating system: macOS T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@skyzh
Copy link
Contributor

skyzh commented Nov 14, 2022

Sorry for not having a MCVE for this issue. I'm still constructing it. For now I have to refer to a large codebase to reproduce the issue.

Basically, after we upgrade from nightly-2022-07-29 to nightly-2022-10-16, we found two things:

This issue is stably reproducible on some specific commits of our project with some specific way of compiling, so I guess it's probably not related to incremental compile. I guess it would be more likely to be a problem with the LLVM 15 upgrade in August or the std::backtrace::Backtrace stabilization.

Reproduce 1

On this commit: risingwavelabs/risingwave@227e9e5

RUST_BACKTRACE=1 cargo run --bin risingwave -- playground

In another terminal, use psql (Postgres's client) to connect to the program:

psql -h localhost -p 4566 -d dev -U root
CREATE TABLE t(a int, b int);
CREATE VIEW v AS SELECT * FROM t;
DROP TABLE t;

The program will immediately segfault in Backtrace::capture.

Interestingly, if we use cargo build -p risingwave_cmd_all && ./target/debug/risingwave playground, it works. The commit following the buggy commit risingwavelabs/risingwave@604a0a5 also magically resolves the issue with some random code change.

Reproduce 2

On this commit: risingwavelabs/risingwave@484b9ab

cargo build -p risingwave_cmd
RUST_BACKTRACE=1 ./target/debug/meta-node # in terminal 1
RUST_BACKTRACE=1 ./target/debug/compute-node # in terminal 2
RUST_BACKTRACE=1 ./target/debug/frontend # in terminal 3

In another terminal:

psql -h localhost -p 4566 -d dev -U root
CREATE TABLE BOOLTBL2 (f1 bool); INSERT INTO BOOLTBL2 (f1) VALUES (bool 'XXX');

compute-node will also immediately segfault when capturing a backtrace.

Thanks for investigating into this!

Meta

rustc --version --verbose:

rustc 1.66.0-nightly (b8c35ca26 2022-10-15)
binary: rustc
commit-hash: b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1
commit-date: 2022-10-15
host: aarch64-apple-darwin
release: 1.66.0-nightly
LLVM version: 15.0.2
Backtrace

<backtrace>

@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

@rustbot label +A-LLVM +T-compiler

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 14, 2022
@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

We also observed that DWARF region increased significantly between the toolchain update:

The suspicious thing

is the dwarf region in the binary.

On 2022-07-29:

objdump target/debug/risingwave  --dwarf=frame | wc -l
     371

On 2022-10-16:

objdump target/debug/risingwave  --dwarf=frame | wc -l
 14671494

Which may relate to the slower backtrace, but I still have no idea why it will segfault sometimes...

@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

I also had an idea that it may be related to the debuginfo=unpacked stabilization, but it turned out that debuginfo is always unpacked on macOS across the two toolchains? :(

@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

segfault backtrace:

* thread #10, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001944f6770 libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions:
->  0x1944f6770 <+204>: ldrb   w8, [x28], #0x1
    0x1944f6774 <+208>: stur   x28, [x29, #-0x98]
    0x1944f6778 <+212>: cmp    w8, #0x2f
    0x1944f677c <+216>: b.hi   0x1944f72c4               ; <+3104>
Target 0: (risingwave) stopped.
(lldb) bt
* thread #10, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001944f6770 libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x00000001944f6624 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x00000001944f62fc libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x00000001944f6220 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1228
    frame #4: 0x00000001944f86b0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x00000001944fb0f0 libunwind.dylib`_Unwind_Backtrace + 348
    frame #6: 0x00000001090822cc risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h471a59e08ff9e5dc at mod.rs:66:5 [opt]
    frame #7: 0x00000001090822bc risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::h4e694232d85e2708 at mod.rs:66:5 [opt]
    frame #8: 0x00000001090822b0 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 at backtrace.rs:333:13 [opt]
    frame #9: 0x00000001056e7b5c risingwave`_$LT$risingwave_meta..error..MetaError$u20$as$u20$core..convert..From$LT$risingwave_meta..error..MetaErrorInner$GT$$GT$::from::hb4b62fbc8685e728(inner=<unavailable>) at error.rs:69:18
    frame #10: 0x0000000105e86adc risingwave`_$LT$T$u20$as$u20$core..convert..Into$LT$U$GT$$GT$::into::h7837bc8fb77e8181(self=<unavailable>) at mod.rs:726:9
    frame #11: 0x00000001056e7fe4 risingwave`risingwave_meta::error::MetaError::permission_denied::h6592a4f64415a283(s=<unavailable>) at error.rs:103:9

@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

The latest nightly (2022-11-14) will still produce this segfault. It's somehow very easy to reproduce over the codebase, but some simple changes in the codebase (e.g., add a println before capturing backtrace) will make it magically work again.

Maybe related: #47551?

@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

@rustbot label +A-LLVM +I-crash

@rustbot rustbot added the I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. label Nov 14, 2022
@skyzh
Copy link
Contributor Author

skyzh commented Nov 14, 2022

@skyzh
Copy link
Contributor Author

skyzh commented Nov 15, 2022

Can be workaround by using rust-lld or LLVM 15 lld (brew install llvm@15)

rustflags = [
  "-Clink-arg=-fuse-ld=/opt/homebrew/opt/llvm@15/bin/ld64.lld",
]

So it looks like a bug with macOS's bundled linker?

$ cc --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

@skyzh skyzh changed the title capturing stack backtrace becomes slower and sometimes segfaults capturing stack backtrace becomes slower and sometimes segfaults on Apple Silicon Nov 15, 2022
@skyzh
Copy link
Contributor Author

skyzh commented Nov 15, 2022

@xxchan
Copy link
Contributor

xxchan commented Apr 8, 2023

Guess it's due to e2b52ff

@xxchan
Copy link
Contributor

xxchan commented Apr 9, 2023

The segfault is a NPE EXC_BAD_ACCESS (code=1, address=0x0) in parseFDEInstructions, which looks different from #47551

@xxchan
Copy link
Contributor

xxchan commented Apr 10, 2023

@xxchan
Copy link
Contributor

xxchan commented Apr 10, 2023

Can be workaround by using rust-lld or LLVM 15 lld (brew install llvm@15)

This is wrong. risingwavelabs/risingwave@5e0b9d8 with homebrew's llvm15's lld segfaults with the same reason.

@xxchan
Copy link
Contributor

xxchan commented Apr 11, 2023

I also had an idea that it may be related to the debuginfo=unpacked stabilization, but it turned out that debuginfo is always unpacked on macOS across the two toolchains? :(

Besides, I tested all 3 settings of split-debuginfo on Linux/mac. Both have similar performance: 100+ms on mac and 100+µs on linux

@workingjubilee workingjubilee added A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows O-AArch64 Armv8-A or later processors in AArch64 mode O-macos Operating system: macOS labels Jul 11, 2023
@IvanGoncharov
Copy link

IvanGoncharov commented Jun 11, 2024

I experience exactly the same issue in apollographql/router#5358
Our CI crashed during a test on MacOS, and segfault backtrace extracted from core file looks like this:

* thread #3, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
  * frame #0: 0x00000001b712785c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x00000001b7127710 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x00000001b71273e8 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x00000001b71272a0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1012
    frame #4: 0x00000001b7129788 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x00000001b712c138 libunwind.dylib`_Unwind_Backtrace + 352
    frame #6: 0x0000000107f5eaf4 apollo_router-af108af4cec596d1`std::backtrace::Backtrace::create::h8af6f7a25d8a3749 [inlined] std::backtrace_rs::backtrace::libunwind::trace::hadefa2fe489b32be at libunwind.rs:104:5 [opt]
    frame #7: 0x0000000107f5eae4 apollo_router-af108af4cec596d1`std::backtrace::Backtrace::create::h8af6f7a25d8a3749 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::hcdcd1d1cfc2eb045 at mod.rs:66:5 [opt]
    frame #8: 0x0000000107f5ead8 apollo_router-af108af4cec596d1`std::backtrace::Backtrace::create::h8af6f7a25d8a3749 at backtrace.rs:331:13 [opt]
    frame #9: 0x0000000105093130 apollo_router-af108af4cec596d1`_$LT$apollo_federation..error..FederationError$u20$as$u20$core..convert..From$LT$apollo_federation..error..SingleFederationError$GT$$GT$::from::h82f039bb6390e43a(inner=<unavailable>) at mod.rs:508:20
    frame #10: 0x0000000104ec385c apollo_router-af108af4cec596d1`_$LT$core..result..Result$LT$T$C$F$GT$$u20$as$u20$core..ops..try_trait..FromResidual$LT$core..result..Result$LT$core..convert..Infallible$C$E$GT$$GT$$GT$::from_residual::h4be605c22a0be363(residual=<unavailable>) at result.rs:1959:27
    frame #11: 0x0000000104f93a30 apollo_router-af108af4cec596d1`apollo_federation::link::federation_spec_definition::get_federation_spec_definition_from_subgraph::hd30c64632e80bc80(schema=0x000000016fe775c0) at federation_spec_definition.rs:413:27
    frame #12: 0x00000001050ce938 apollo_router-af108af4cec596d1`apollo_federation::schema::compute_subgraph_metadata::hb4be50d87da3ab88(schema=0x000000016fe775c0) at mod.rs:40:49

@workingjubilee
Copy link
Member

huuuh, that is very weird if it's faulting on a null pointer inside libunwind...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. O-AArch64 Armv8-A or later processors in AArch64 mode O-macos Operating system: macOS T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants