-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debug: replace RtlCaptureStackBackTrace (which was spuriously failing) with a new implementation which uses RtlVirtualUnwind instead #12740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating and the PR @kcbanner - this is some great work! Before we commit to using kernel32
calls in std.debug
however, might I suggest we first try re-implementing RtlCaptureStackBackTrace
directly in Zig (as part of std.os.windows
module)? I'm not saying this will necessarily also work but it might shed more light on where the spurious failures are happening and would give us the ability to tweak the code as we please. For reference, here's the implementation of RtlCaptureStackBackTrace
and RtlWalkFrameChain
in ReactOS which I recommend using as a guide.
When it comes to testing, currently we don't support running stage2 incremental linking tests on Windows, so you will need to pass |
Thanks for the feedback! I'll take a look at those ReactOS implementations |
45957af
to
537c376
Compare
Ah, good catches - fixed. I'll work on the proposal we talked about for the COFF unwind info once this is merged 👍 |
I believe something has regressed here with LLVM 15. I was testing this against latest master and the 64 bit path is broken both on master (old method) and with these changes. On this branch, Using a zig re-implementation of My guess is something changed with the LLVM 15 update that caused the linker not to include this directory anymore (and this is a debug build). In LLV14, this section was present. Regardless, this PR isn't ready to merge until I figure out why this directory isn't there anymore. |
This is confirmed by The LLVM15 compiled version has a missing .pdata section (which corresponds to the exception directory entry, which is also 0). exe compiled in debug mode with with zig that used LLVM 14.0.6:
exe compiled with latest zig master, LLVM 15.0.0:
|
The regression above is fixed with #12959 |
First of, I'd like to apologise for taking so long to take another look at your PR! In general, I believe we are ready to merge except the merge conflicts, and so, would you mind rebasing on latest master so that we can follow up with a merge? |
No problem, I'll get this updated soon 👍 |
3daf0eb
to
8a3f422
Compare
Ok, rebased onto master 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking this on.
Is this better than status quo? I get that it's a workaround that seems to have worked for somebody on the Internet, but what the hell is going on here? I think we need to have a well-documented justification for this. It seems like this has not been investigated thoroughly to me.
If RtlVirtualUnwind
is a lower level call, then this starts to makes sense to me, especially if RtlCaptureStackBackTrace
is internally using it in its implementation. Perhaps we could look at ReactOS source code to help determine this.
The main requested changes I have here is regarding #1840. This PR is taking things in the wrong direction.
Personally I feel like it is an improvement, because it did solve my own issue, which was that I wasn't getting stack traces reliably. I understand the syscall may be a deal breaker though. There is some more context in this thread around why I opted to use the kernel32 call: https://discord.com/channels/605571803288698900/1017808096682311740/1018889803128918056 Essentially, I did write an implementation of The implementation (WIP: working on x64, but not cleaned up): https://gist.github.com/kcbanner/61058fba639fca1a6bda13773489a484 So while that does work, you still have to make the kernel32 call, otherwise you risk reading the list while it's being changed. Other than that, it can walk the stack completely without any syscalls. If we are ok with just not locking that critical section, then this could be done entirely in zig without syscalls. There would be potential for crashes if the module list is modified while it's being iterated though. With that in mind (and the use of a bunch of undocumented data structures), I stopped with the custom implementation since we needed a kernel32 call anyway. I'm happy to explore that further, but we'd still need that kernel32 call to grab the lock. If we ignore the lock then we can't safely iterate that linked list, and if we cache results then there is a TOCTOU issue. From what I could tell the ReactOS version of I think the Microsoft impl is probably different than this, since it did work sometimes. So I could run through the disasm of that to see what's actually going on in order to debug the original spurious failure - but I didn't do that investigation yet. |
Thank you for this writeup. I appreciate your patience with me here. In NtDll.dll, I see the following functions:
After reading your writeup, and making note of these functions, I'm in favor of your changes, but I still want to emphasize #1840. Our goal is to transition to NtDll being the only dependency of the standard library. It seems compatible with your changes. Does that sound reasonable to you? |
No problem, and yes that make sense to me! I actually had misunderstood and didn't realize kernel32 wrapped ntdll, I thought they were independent, since the MSDN docs always refer to kernel32 for things like RtlVirtualUnwind. Reading #1840 now everything makes more sense. |
5006c80
to
874d128
Compare
874d128
to
26bdb19
Compare
Updated to call via ntdll instead of kernel32. Should I keep the definitions of these functions in kernel32.zig for users to call, or I remove them now that they exist in ntdll.zig as well? |
5f410cc
to
3c2f980
Compare
3c2f980
to
a920b30
Compare
…) with a new implementation which uses RtlVirtualUnwind instead windows: add RtlCaptureContext, RtlLookupFunctionEntry, RtlVirtualUnwind and supporting types windows: fix alignment of CONTEXT structs to match winnt.h as required by RtlCaptureContext (fxsave instr) windows aarch64: fix __chkstk being defined twice if libc is not linked on msvc
…ON_REGISTRATION_RECORD
a920b30
to
ede243a
Compare
This question can be left for #4426 to solve |
Great work! Thanks for the follow ups. |
This PR implements
std.debug.walkStackWindows
as a replacement forRtlCaptureStackBackTrace
.Context
As part of debugging #12703 I was investigating why stack traces weren't appearing, and realized a simple test program that just panicked would not always print the stack trace.
Note that I'm using
zig2
here as I can't build stage3 due to #12703.Example program:
The reason for this is that
RtlCaptureStackBackTrace
spuriously returns 0. I couldn't determine the root cause of the spurious failures, but it seemed that others had run into this (https://developercommunity.visualstudio.com/t/capturestackbacktrace-randomly-fails-after-initial/1383213) and had gotten around it by unwinding the stack via theRtlLookupFunctionEntry
/RtlVirtualUnwind
calls.I referenced these docs for the unwind process:
Supporting changes:
RtlCaptureContext
,RtlLookupFunctionEntry
,RtlVirtualUnwind
and supporting typesCONTEXT
structs to match winnt.h as required byRtlCaptureContext
(fxsave
instr)CONTEXT
was put onto the stack and happened to not be aligned to 16, then thefxsave
instruction would cause an access violation-target aarch64-windows-msvc
Testing
I'm unsure how to test on windows aarch64. I don't have the hardware - and while the code compiles against
-target aarch64-windows-msvc
, I can't be sure it functions. It seems like qemu for windows doesn't support the same userspace emulation as linux for running aarch64 binaries.When running
zig2 build test -Dskip-release
, I get failures such as:However, these also seem present on master, just with the old stack trace function. This seems like #12420?
I've made my best efforts to follow the style/contributing guidelines and but please let me know if I've missed something as this is my first PR here. I'm also happy to defer this until I can figure out #12703 and properly run the tests.