Skip to content

valgrind: Mark ppc64 w/ ELFv2 as unsupported#295906

Open
OPNA2608 wants to merge 1 commit intoNixOS:stagingfrom
OPNA2608:ppc64/valgrind
Open

valgrind: Mark ppc64 w/ ELFv2 as unsupported#295906
OPNA2608 wants to merge 1 commit intoNixOS:stagingfrom
OPNA2608:ppc64/valgrind

Conversation

@OPNA2608
Copy link
Contributor

@OPNA2608 OPNA2608 commented Mar 14, 2024

Description of changes

  • Tests are still failing due to ELFv1 hardcoding
  • We can build for ppc64 ELFv2, but running the resulting valgrind binary on anything Nixpkgs produces results in
    • a SIGSEGV with glibc
      • many reports of uninitialised value usage inside glibc
      • the process terminating on a SIGSEGV with "Bad permissions for mapped region" inside glibc
      • valgrind itself crashing with a SIGSEGV in do_syscall_WRK
    • a slightly more graceful crash of valgrind with musl
      • similar reports of uninitialised value usage inside musl
      • valgrind exits upon encountering "the impossible"

Details on debugging

Here is a full log of valgrind <hello>/bin/hello (both ELFv2). Lots of Conditional jump or move depends on uninitialised value(s) & Use of uninitialised value of size, ending in

==215556== Process terminating with default action of signal 11 (SIGSEGV)
==215556==  Bad permissions for mapped region at address 0x4067F00
==215556==    at 0x4067F00: ??? (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x401E41B: _dl_relocate_object (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x403492B: dl_main (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x403016F: _dl_sysdep_start (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x4031957: _dl_start_final (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x403207B: _dl_start (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556==    by 0x4030C37: (below main) (in /nix/store/13mhdf5x574x9wnrjpdk79k8ir52fw5n-glibc-2.38-44/lib/ld64.so.2)
==215556== 
==215556== HEAP SUMMARY:
==215556==     in use at exit: 0 bytes in 0 blocks
==215556==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==215556== 
==215556== All heap blocks were freed -- no leaks are possible
==215556== 
==215556== Use --track-origins=yes to see where uninitialised values come from
==215556== For lists of detected and suppressed errors, rerun with: -s
==215556== ERROR SUMMARY: 4055 errors from 341 contexts (suppressed: 73 from 3)
fish: Job 1, '/nix/store/0ym9av408mv66ycgwlih…' terminated by signal SIGSEGV (Segmentation fault)

This works fine on ELFv1 (ELFv1 valgrind, on ELFv1 hello):

==217386== Memcheck, a memory error detector
==217386== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==217386== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==217386== Command: /nix/store/g2q9l4c7al2yzg3l4hhv7j1sl1d03mcc-hello-powerpc64-unknown-linux-gnuabielfv1-2.12.1/bin/hello
==217386== 
Hello, world!
==217386== 
==217386== HEAP SUMMARY:
==217386==     in use at exit: 0 bytes in 0 blocks
==217386==   total heap usage: 6 allocs, 6 frees, 5,318 bytes allocated
==217386== 
==217386== All heap blocks were freed -- no leaks are possible
==217386== 
==217386== For lists of detected and suppressed errors, rerun with: -s
==217386== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

So I've arrived at the conclusion that this just doesn't work.

Adding this platform to badPlatforms lets us use lib.meta.availableOn in other derivations (i.e. libdrm, which already uses it).


Draft, because I still wanna try installing Adélie Linux on my hardware and give the valgrind there a test (which is where this patch seems to originate from). If it works there, then I'll try to dig deeper into this.
Edit: See follow-up comments for results of that.

CC @alyssais, did the resulting binary work when you added the patch in #213341? I tried locally undoing just the valgrind bumps since then, but the binary still completely fails.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • powerpc64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@OPNA2608 OPNA2608 changed the base branch from master to staging March 14, 2024 14:59
@ofborg ofborg bot requested a review from edolstra March 14, 2024 15:22
@ofborg ofborg bot added 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. labels Mar 14, 2024
@OPNA2608
Copy link
Contributor Author

Draft, because I still wanna try installing Adélie Linux on my hardware and give the valgrind there a test (which is where this patch seems to originate from). If it works there, then I'll try to dig deeper into this.

puna@HC ~> grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Adélie Linux 1.0 (Beta 5)"
puna@HC ~> file ./test
./test: ELF 64-bit MSB pie executable, 64-bit PowerPC or cisco 7500, OpenPOWER ELF V2 ABI, version 1 (SYSV), dynamically linked, interpreter /lib/ld-musl-powerpc64.so.1, with debug_info, not stripped
puna@HC ~> valgrind -q ./test
Hello world!
puna@HC ~> echo $status
0

Dig deeper, we shall do…

@OPNA2608
Copy link
Contributor Author

Okay… Adélie's valgrind works with binaries produced by Adélie's GCC+musl toolchain. As does our valgrind with binaries from Adélie's toolchain. But Adélie's valgrind dies in the same manner as reported above with binaries from our GCC toolchain (glibc & musl, doesn't matter). So I guess something about our toolchain is breaking this?

I'll rework this to instead just mark the tests as unsupported, since they're definitely broken in this config - the ELFv2 patch comes from Adélie, and they don't run the tests there. Maybe we can also fetch another ppc64 patch from them, though I haven't tested if it works on glibc yet.

@OPNA2608
Copy link
Contributor Author

Testing with hello:

  • Our glibc: Weird traces, SIGSEGV
  • Our musl: Weird traces, Graceful error about Valgrind hitting an impossible situation it doesn't know how to handle
  • Adélie's musl (patchelf'd in): No issues at all

Also tried:

  • Applying all of Adélie's musl patches: No change
  • Using the same GCC version as Adélie: No change
  • Using Clang: No change

The graceful error:

MC_(get_otrack_shadow_offset)(ppc64)(off=44,sz=4)

Memcheck: mc_machine.c:403 (get_otrack_shadow_offset_wrk): the 'impossible' happened.

host stacktrace:
==4118==    at 0x58053514: show_sched_status_wrk (m_libcassert.c:406)
==4118==    by 0x580536E7: report_and_quit (m_libcassert.c:477)
==4118==    by 0x58053887: vgPlain_assert_fail (m_libcassert.c:543)
==4118==    by 0x5804285F: get_otrack_shadow_offset_wrk (mc_machine.c:403)
==4118==    by 0x5804285F: vgMemCheck_get_otrack_shadow_offset (mc_machine.c:97)
==4118==    by 0x58008BBB: mb_get_origin_for_guest_offset (mc_main.c:4618)
==4118==    by 0x58008BBB: mc_pre_reg_read (mc_main.c:4685)
==4118==    by 0x580E2BAB: vgSysWrap_generic_sys_read_before (syswrap-generic.c:4261)
==4118==    by 0x580D0E7B: vgPlain_client_syscall (syswrap-main.c:2240)
==4118==    by 0x580CAE37: handle_syscall (scheduler.c:1206)
==4118==    by 0x580CDFDB: vgPlain_scheduler (scheduler.c:1552)
==4118==    by 0x5812011B: thread_wrapper (syswrap-linux.c:102)
==4118==    by 0x5812011B: run_a_thread_NORETURN (syswrap-linux.c:155)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 3 (lwpid 4118)
==4118==    at 0x407B2C0: ??? (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4069FC7: __syscall_cp (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407590F: read (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x40780CB: map_library (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4078CFF: load_library (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4079F93: __dls3 (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407969B: __dls2b (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x40795A7: __dls2 (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4076AF3: _dlstart_c (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407B313: ??? (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
client stack range: [0x1FFEFFF000 0x1FFF000FFF] client SP: 0x1FFEFFFA50
valgrind stack range: [0x1002CC4000 0x1002DC3FFF] top usage: 6576 of 1048576

Zero experience with Valgrind internals or POWER register details. AFAICT it's trying to identify what register that offset & size maps to, and comes up empty. 64-bit POWER ELF V2 specs say offset 44 is in the middle of an 8-byte general-purpose register, so bad offset & size. It would match GPR12 on 32-bit PowerPC… but I'm not sure how we could arrive there? Also, there's still the issue of all the memory error traces inside the libcs before this.

I'm really not sure what to make of all of this… Considering that for all intents and purposes, Nixpkgs' valgrind just doesn't seem to work for ELF V2 libcs from Nixpkgs, maybe marking it unsupported does makes the most sense here.

@OPNA2608
Copy link
Contributor Author

OPNA2608 commented Apr 29, 2024

I don't have the capacity / willpower to debug this much more for now, let's just try marking this as unsupported. I'm also switching the ELFv2 patch from the Void version to the Adélie one, just to track the upstream of this more closely.

@OPNA2608 OPNA2608 marked this pull request as ready for review April 29, 2024 15:50
@OPNA2608 OPNA2608 added the 6.topic: exotic Exotic hardware or software platform label Apr 29, 2024
@OPNA2608 OPNA2608 force-pushed the ppc64/valgrind branch 2 times, most recently from 3162b09 to c140563 Compare April 29, 2024 20:57
- Tests are still failing due to ELFv1 hardcoding
- We can build for ppc64 ELFv2, but running the resulting valgrind binary on anything Nixpkgs produces results in
  - a SIGSEGV with glibc
    - many reports of uninitialised value usage inside glibc
    - the process terminating on a SIGSEGV with "Bad permissions for mapped region" inside glibc
    - valgrind itself crashing with a SIGSEGV in do_syscall_WRK
  - a slightly more graceful crash of valgrind with musl
    - similar reports of uninitialised value usage inside musl
    - valgrind exits upon encountering "the impossible":

MC_(get_otrack_shadow_offset)(ppc64)(off=44,sz=4)

Memcheck: mc_machine.c:403 (get_otrack_shadow_offset_wrk): the 'impossible' happened.

host stacktrace:
==4118==    at 0x58053514: show_sched_status_wrk (m_libcassert.c:406)
==4118==    by 0x580536E7: report_and_quit (m_libcassert.c:477)
==4118==    by 0x58053887: vgPlain_assert_fail (m_libcassert.c:543)
==4118==    by 0x5804285F: get_otrack_shadow_offset_wrk (mc_machine.c:403)
==4118==    by 0x5804285F: vgMemCheck_get_otrack_shadow_offset (mc_machine.c:97)
==4118==    by 0x58008BBB: mb_get_origin_for_guest_offset (mc_main.c:4618)
==4118==    by 0x58008BBB: mc_pre_reg_read (mc_main.c:4685)
==4118==    by 0x580E2BAB: vgSysWrap_generic_sys_read_before (syswrap-generic.c:4261)
==4118==    by 0x580D0E7B: vgPlain_client_syscall (syswrap-main.c:2240)
==4118==    by 0x580CAE37: handle_syscall (scheduler.c:1206)
==4118==    by 0x580CDFDB: vgPlain_scheduler (scheduler.c:1552)
==4118==    by 0x5812011B: thread_wrapper (syswrap-linux.c:102)
==4118==    by 0x5812011B: run_a_thread_NORETURN (syswrap-linux.c:155)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 3 (lwpid 4118)
==4118==    at 0x407B2C0: ??? (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4069FC7: __syscall_cp (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407590F: read (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x40780CB: map_library (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4078CFF: load_library (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4079F93: __dls3 (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407969B: __dls2b (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x40795A7: __dls2 (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x4076AF3: _dlstart_c (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
==4118==    by 0x407B313: ??? (in /nix/store/5837cqfasvlj1xzdwd4f47fgi3c6cszg-musl-powerpc64-unknown-linux-musl-1.2.3/lib/libc.so)
@OPNA2608
Copy link
Contributor Author

This is a long shot (and apologies for the random ping, feel free to disregard if you're busy) but @awilfox maybe you have any input/advice on this? I'm not super knowledgable about how to debug this… Maybe our compiler is slightly misconfigured for your valgrind patch?

@awilfox
Copy link

awilfox commented May 16, 2024

I tried, but find digging through the gcc packaging of nix quite exhausting. Everyone I know that used to use nix has switched away, so this is rough. However, I would be quite surprised if it wasn't something related to the toolchain.

  • I didn't see --enable-secureplt anywhere, but musl itself shouldn't work without it, so I don't think that's it.
  • I didn't see --enable-decimal-float=no, but again, I believe that's a default and musl wouldn't work without that being set correctly.
  • We enable relro/now by default, so that could be something different.

Perhaps you could compile a very small hello world binary using our toolchain and yours, and check output of objdump on both, or even a binary dump if you can make it small enough, to check the difference.

n.b. we don't run valgrind's test suite because a lot of it has hardcoded assumptions conflating endianness with ABI type (BE = ELFv1, EL = ELFv2) when that is incorrect. The tests that don't have those assumptions pass easily; I think we have around 70% passing on ppc64. Fixing it is very low on my priority list right now, but I'm hoping to fix up test suites in general in the autumn, and that might be one of them.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Sep 27, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: exotic Exotic hardware or software platform 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants