Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug: replace RtlCaptureStackBackTrace (which was spuriously failing) with a new implementation which uses RtlVirtualUnwind instead #12740

Merged
merged 10 commits into from
Jan 4, 2023

Conversation

kcbanner
Copy link
Contributor

@kcbanner kcbanner commented Sep 4, 2022

This PR implements std.debug.walkStackWindows as a replacement for RtlCaptureStackBackTrace.

Context

As part of debugging #12703 I was investigating why stack traces weren't appearing, and realized a simple test program that just panicked would not always print the stack trace.

Note that I'm using zig2 here as I can't build stage3 due to #12703.

Example program:

fn bar() void {
    @panic("test");
}

fn foo() void {
    bar();
}

pub fn main() void {
    foo();
}
>zig2 build-exe panic.zig -target native-windows-msvc

>panic.exe
thread 24112 panic: test

>panic.exe
thread 6572 panic: test
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:2:5: 0x7ff7c6e31d7f in bar (panic.exe.obj)
    @panic("test");
    ^
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:6:8: 0x7ff7c6e315ce in foo (panic.exe.obj)
    bar();
       ^
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:10:8: 0x7ff7c6e3128e in main (panic.exe.obj)
    foo();
       ^
C:\cygwin64\home\kcbanner\kit\zig\lib\std\start.zig:345:41: 0x7ff7c6e31247 in WinStartup (panic.exe.obj)
    std.debug.maybeEnableSegfaultHandler();
                                        ^
???:?:?: 0x7ffac2397033 in ??? (???)
???:?:?: 0x7ffac2ce2650 in ??? (???)

>panic.exe
thread 3864 panic: test

>panic.exe
thread 19984 panic: test

The reason for this is that RtlCaptureStackBackTrace spuriously returns 0. I couldn't determine the root cause of the spurious failures, but it seemed that others had run into this (https://developercommunity.visualstudio.com/t/capturestackbacktrace-randomly-fails-after-initial/1383213) and had gotten around it by unwinding the stack via the RtlLookupFunctionEntry/RtlVirtualUnwind calls.

I referenced these docs for the unwind process:

Supporting changes:

  • added RtlCaptureContext, RtlLookupFunctionEntry, RtlVirtualUnwind and supporting types
  • fix alignment of CONTEXT structs to match winnt.h as required by RtlCaptureContext (fxsave instr)
    • If a CONTEXT was put onto the stack and happened to not be aligned to 16, then the fxsave instruction would cause an access violation
  • windows aarch64: fix __chkstk being defined twice if libc is not linked on msvc
    • I'm not familiar with this system, but I was hitting this compile error with -target aarch64-windows-msvc

Testing

I'm unsure how to test on windows aarch64. I don't have the hardware - and while the code compiles against -target aarch64-windows-msvc, I can't be sure it functions. It seems like qemu for windows doesn't support the same userspace emulation as linux for running aarch64 binaries.

When running zig2 build test -Dskip-release, I get failures such as:

compiler [51/1153] compile_error_in_inline_fn_call_fixed (stage2, aarch64-macos) [2/2] update [2... error.Unexpected NTSTATUS=0xc0000056
c:\cygwin64\home\kcbanner\kit\zig\lib\std\debug.zig:608:31: 0x7ff77d15ff6f in writeCurrentStackTraceWindows__anon_57536 (test.exe.obj)
    const n = walkStackWindows(addr_buf[0..]);
                              ^
c:\cygwin64\home\kcbanner\kit\zig\lib\std\debug.zig:552:45: 0x7ff77d15fe8d in writeCurrentStackTrace__anon_57535 (test.exe.obj)
        return writeCurrentStackTraceWindows(out_stream, debug_info, tty_config, start_addr);
                                            ^

However, these also seem present on master, just with the old stack trace function. This seems like #12420?

I've made my best efforts to follow the style/contributing guidelines and but please let me know if I've missed something as this is my first PR here. I'm also happy to defer this until I can figure out #12703 and properly run the tests.

Copy link
Member

@kubkon kubkon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and the PR @kcbanner - this is some great work! Before we commit to using kernel32 calls in std.debug however, might I suggest we first try re-implementing RtlCaptureStackBackTrace directly in Zig (as part of std.os.windows module)? I'm not saying this will necessarily also work but it might shed more light on where the spurious failures are happening and would give us the ability to tweak the code as we please. For reference, here's the implementation of RtlCaptureStackBackTrace and RtlWalkFrameChain in ReactOS which I recommend using as a guide.

lib/compiler_rt/stack_probe.zig Outdated Show resolved Hide resolved
@kubkon
Copy link
Member

kubkon commented Sep 6, 2022

When it comes to testing, currently we don't support running stage2 incremental linking tests on Windows, so you will need to pass -Dskip-stage2-tests to your test run. I'm in the process of finishing up enough of the in-house COFF/PE linker that soon this will not be needed so stay tuned!

@kcbanner
Copy link
Contributor Author

kcbanner commented Sep 7, 2022

Thanks for the feedback! I'll take a look at those ReactOS implementations

@andrewrk andrewrk force-pushed the windows_stack_trace_fixes branch 2 times, most recently from 45957af to 537c376 Compare September 18, 2022 19:43
lib/std/os/windows.zig Outdated Show resolved Hide resolved
lib/std/os/windows.zig Outdated Show resolved Hide resolved
@kcbanner
Copy link
Contributor Author

Ah, good catches - fixed. I'll work on the proposal we talked about for the COFF unwind info once this is merged 👍

@kcbanner
Copy link
Contributor Author

I believe something has regressed here with LLVM 15. I was testing this against latest master and the 64 bit path is broken both on master (old method) and with these changes. On this branch, RtlLookupFunctionEntry returns null.

Using a zig re-implementation of RtlVirtualUnwind that I've been working on, I can see that the coff.DirectoryEntry.EXCEPTION directory has a size of zero. This directory in the COFF is supposed to contain entries which have the start/end address of each function, so that you can map an IP back to a function (this is what RtlLookupFunctionEntry does).

My guess is something changed with the LLVM 15 update that caused the linker not to include this directory anymore (and this is a debug build). In LLV14, this section was present.

Regardless, this PR isn't ready to merge until I figure out why this directory isn't there anymore.

@kcbanner
Copy link
Contributor Author

This is confirmed by dumpbin:

The LLVM15 compiled version has a missing .pdata section (which corresponds to the exception directory entry, which is also 0).

exe compiled in debug mode with with zig that used LLVM 14.0.6:

Dump of file comments.exe

PE signature found

File Type: EXECUTABLE IMAGE

FILE HEADER VALUES
            8664 machine (x64)
               7 number of sections
        630B005A time date stamp Sun Aug 28 01:42:50 2022
               0 file pointer to symbol table
               0 number of symbols
              F0 size of optional header
              22 characteristics
                   Executable
                   Application can handle large (>2GB) addresses

OPTIONAL HEADER VALUES
             20B magic # (PE32+)
           14.00 linker version
           72C00 size of code
           44200 size of initialized data
               0 size of uninitialized data
            1230 entry point (0000000140001230) wWinMainCRTStartup
            1000 base of code
       140000000 image base (0000000140000000 to 00000001400BDFFF)
            1000 section alignment
             200 file alignment
            6.00 operating system version
            0.00 image version
            6.00 subsystem version
               0 Win32 version
           BE000 size of image
             400 size of headers
               0 checksum
               3 subsystem (Windows CUI)
            8160 DLL characteristics
                   High Entropy Virtual Addresses
                   Dynamic base
                   NX compatible
                   Terminal Server Aware
         1000000 size of stack reserve
            1000 size of stack commit
          100000 size of heap reserve
            1000 size of heap commit
               0 loader flags
              10 number of directories
               0 [       0] RVA [size] of Export Directory
           B2598 [      3C] RVA [size] of Import Directory
               0 [       0] RVA [size] of Resource Directory
           B8000 [    2610] RVA [size] of Exception Directory
               0 [       0] RVA [size] of Certificates Directory
           BD000 [     170] RVA [size] of Base Relocation Directory
           B2538 [      1C] RVA [size] of Debug Directory
               0 [       0] RVA [size] of Architecture Directory
               0 [       0] RVA [size] of Global Pointer Directory
           B2510 [      28] RVA [size] of Thread Storage Directory
               0 [       0] RVA [size] of Load Configuration Directory
               0 [       0] RVA [size] of Bound Import Directory
           B26E0 [     108] RVA [size] of Import Address Table Directory
               0 [       0] RVA [size] of Delay Import Directory
               0 [       0] RVA [size] of COM Descriptor Directory
               0 [       0] RVA [size] of Reserved Directory

exe compiled with latest zig master, LLVM 15.0.0:

Dump of file comments_llvm15.exe

PE signature found

File Type: EXECUTABLE IMAGE

FILE HEADER VALUES
            8664 machine (x64)
               6 number of sections
        632FCC2B time date stamp Sat Sep 24 23:34:03 2022
               0 file pointer to symbol table
               0 number of symbols
              F0 size of optional header
              22 characteristics
                   Executable
                   Application can handle large (>2GB) addresses

OPTIONAL HEADER VALUES
             20B magic # (PE32+)
           14.00 linker version
           73800 size of code
           43600 size of initialized data
               0 size of uninitialized data
            1280 entry point (0000000140001280) wWinMainCRTStartup
            1000 base of code
       140000000 image base (0000000140000000 to 00000001400BCFFF)
            1000 section alignment
             200 file alignment
            6.00 operating system version
            0.00 image version
            6.00 subsystem version
               0 Win32 version
           BD000 size of image
             400 size of headers
               0 checksum
               3 subsystem (Windows CUI)
            8160 DLL characteristics
                   High Entropy Virtual Addresses
                   Dynamic base
                   NX compatible
                   Terminal Server Aware
         1000000 size of stack reserve
            1000 size of stack commit
          100000 size of heap reserve
            1000 size of heap commit
               0 loader flags
              10 number of directories
               0 [       0] RVA [size] of Export Directory
           B774F [      3C] RVA [size] of Import Directory
               0 [       0] RVA [size] of Resource Directory
               0 [       0] RVA [size] of Exception Directory
               0 [       0] RVA [size] of Certificates Directory
           BC000 [     18C] RVA [size] of Base Relocation Directory
           B76E8 [      1C] RVA [size] of Debug Directory
               0 [       0] RVA [size] of Architecture Directory
               0 [       0] RVA [size] of Global Pointer Directory
           B76C0 [      28] RVA [size] of Thread Storage Directory
               0 [       0] RVA [size] of Load Configuration Directory
               0 [       0] RVA [size] of Bound Import Directory
           B78A8 [     118] RVA [size] of Import Address Table Directory
               0 [       0] RVA [size] of Delay Import Directory
               0 [       0] RVA [size] of COM Descriptor Directory
               0 [       0] RVA [size] of Reserved Directory

@kcbanner
Copy link
Contributor Author

The regression above is fixed with #12959

@kubkon
Copy link
Member

kubkon commented Oct 23, 2022

First of, I'd like to apologise for taking so long to take another look at your PR! In general, I believe we are ready to merge except the merge conflicts, and so, would you mind rebasing on latest master so that we can follow up with a merge?

@kcbanner
Copy link
Contributor Author

No problem, I'll get this updated soon 👍

@kcbanner
Copy link
Contributor Author

Ok, rebased onto master 👍

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this on.

Is this better than status quo? I get that it's a workaround that seems to have worked for somebody on the Internet, but what the hell is going on here? I think we need to have a well-documented justification for this. It seems like this has not been investigated thoroughly to me.

If RtlVirtualUnwind is a lower level call, then this starts to makes sense to me, especially if RtlCaptureStackBackTrace is internally using it in its implementation. Perhaps we could look at ReactOS source code to help determine this.

The main requested changes I have here is regarding #1840. This PR is taking things in the wrong direction.

@kcbanner
Copy link
Contributor Author

kcbanner commented Oct 30, 2022

Personally I feel like it is an improvement, because it did solve my own issue, which was that I wasn't getting stack traces reliably. I understand the syscall may be a deal breaker though.

There is some more context in this thread around why I opted to use the kernel32 call: https://discord.com/channels/605571803288698900/1017808096682311740/1018889803128918056

Essentially, I did write an implementation of RtlVirtualUnwind in zig, that parses all the unwind table opcodes from the unwind info to figure out how to walk back up the stack. I referenced the ReactOS implementation as part of that. I managed to implement that completely without any syscalls, except for a critical section call to lock around the list of loaded modules from the PEB.

The implementation (WIP: working on x64, but not cleaned up): https://gist.github.com/kcbanner/61058fba639fca1a6bda13773489a484

So while that does work, you still have to make the kernel32 call, otherwise you risk reading the list while it's being changed. Other than that, it can walk the stack completely without any syscalls. If we are ok with just not locking that critical section, then this could be done entirely in zig without syscalls. There would be potential for crashes if the module list is modified while it's being iterated though. With that in mind (and the use of a bunch of undocumented data structures), I stopped with the custom implementation since we needed a kernel32 call anyway.

I'm happy to explore that further, but we'd still need that kernel32 call to grab the lock. If we ignore the lock then we can't safely iterate that linked list, and if we cache results then there is a TOCTOU issue.

From what I could tell the ReactOS version of RtlCaptureStackBackTrace uses RtlWalkFrameChain (https://doxygen.reactos.org/dd/df8/dll_2ntdll_2rtl_2libsupp_8c.html#a703c006cc016ffffe4d2982c373711d4), which just uses the frame pointer to walk the state. So, that works on x86 and on x64 if there was no extra stack space allocated. However, if there was - then you need to use the unwind info to figure out what the prologue did in order to reverse it.

I think the Microsoft impl is probably different than this, since it did work sometimes. So I could run through the disasm of that to see what's actually going on in order to debug the original spurious failure - but I didn't do that investigation yet.

@andrewrk
Copy link
Member

Thank you for this writeup. I appreciate your patience with me here.

In NtDll.dll, I see the following functions:

  • RtlCaptureContext
  • RtlCaptureContext2
  • RtlLookupFunctionEntry
  • RtlVirtualUnwind
  • RtlVirtualUnwind2
  • RtlEnterCriticalSection
  • RtlLeaveCriticalSection

more

After reading your writeup, and making note of these functions, I'm in favor of your changes, but I still want to emphasize #1840. Our goal is to transition to NtDll being the only dependency of the standard library. It seems compatible with your changes. Does that sound reasonable to you?

@kcbanner
Copy link
Contributor Author

kcbanner commented Oct 30, 2022

No problem, and yes that make sense to me!

I actually had misunderstood and didn't realize kernel32 wrapped ntdll, I thought they were independent, since the MSDN docs always refer to kernel32 for things like RtlVirtualUnwind. Reading #1840 now everything makes more sense.

@kcbanner kcbanner force-pushed the windows_stack_trace_fixes branch 2 times, most recently from 5006c80 to 874d128 Compare November 7, 2022 05:09
@kcbanner
Copy link
Contributor Author

Updated to call via ntdll instead of kernel32.

Should I keep the definitions of these functions in kernel32.zig for users to call, or I remove them now that they exist in ntdll.zig as well?

…) with a new implementation which uses RtlVirtualUnwind instead

windows: add RtlCaptureContext, RtlLookupFunctionEntry, RtlVirtualUnwind and supporting types
windows: fix alignment of CONTEXT structs to match winnt.h as required by RtlCaptureContext (fxsave instr)
windows aarch64: fix __chkstk being defined twice if libc is not linked on msvc
@andrewrk
Copy link
Member

andrewrk commented Jan 4, 2023

Should I keep the definitions of these functions in kernel32.zig for users to call, or I remove them now that they exist in ntdll.zig as well?

This question can be left for #4426 to solve

@andrewrk andrewrk merged commit 09ff03a into ziglang:master Jan 4, 2023
@andrewrk
Copy link
Member

andrewrk commented Jan 4, 2023

Great work! Thanks for the follow ups.

@kcbanner kcbanner deleted the windows_stack_trace_fixes branch January 8, 2023 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants