-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A self-built crossgen2 hangs on some recent Linux distributions #61671
Comments
Maybe we can opt-out of crossgen until we've figured out the problem?
I'd start by attaching a debugger and get a stacktrace. I'll try to reproduce it. |
At I'm ignoring it and I'm now at |
Sorry, that should have been It just pulls down all dependencies to build .NET, which haven't really changed between .NET 5 and 6. If you already have the build dependencies, this command doesn't do anything new. |
The managed threads seem to be stuck trying to obtain the same lock:
Native stacktrace:
|
@mangod9 can you take a look at this issue? |
@tmds said:
That's a great idea! I tried it out and the build completes after disabling crossgen2 in both aspnetcore and
|
Yeah doesnt seem like this is a crossgen2 specific issue, but indicates a runtime/libraries issues. Do the threads always block with |
Retrying with a simple
I don't see a concurrent dictionary object in the stack trace. |
Passing along a comment from https://pagure.io/dotnet-sig/dotnet6.0/issue/1#comment-762682, it looks like compiling with clang 12 makes things work. It's only under clang 13 that things appear to be completely broken. |
Since the issue isn't source-build specific, I tried building runtime repo and run library tests in a |
Hi @tmds, was your test with a build using clang 13 as well? |
@mangod9 yes, only that version is available on Fedora 35. |
For example:
lldb stacktrace:
The |
The generated interop code appears correct. @tmds Can you confirm the output in [System.Runtime.CompilerServices.SkipLocalsInitAttribute]
internal static partial int FStat(global::System.Runtime.InteropServices.SafeHandle fd, out global::Demo.FileStatus output)
{
System.IntPtr __fd_gen_native = default;
output = default;
int __retVal;
int __lastError;
//
// Setup
//
bool fd__addRefd = false;
try
{
//
// Marshal
//
fd.DangerousAddRef(ref fd__addRefd);
__fd_gen_native = fd.DangerousGetHandle();
fixed (global::Demo.FileStatus* __output_gen_native = &output)
{
{
System.Runtime.InteropServices.Marshal.SetLastSystemError(0);
__retVal = __PInvoke__(__fd_gen_native, __output_gen_native);
__lastError = System.Runtime.InteropServices.Marshal.GetLastSystemError();
}
}
}
finally
{
//
// Cleanup
//
if (fd__addRefd)
fd.DangerousRelease();
}
System.Runtime.InteropServices.Marshal.SetLastPInvokeError(__lastError);
return __retVal;
//
// Local P/Invoke
//
[System.Runtime.InteropServices.DllImportAttribute(Libraries.SystemNative, EntryPoint = "SystemNative_FStat")]
extern static unsafe int __PInvoke__(System.IntPtr fd, global::Demo.FileStatus* output);
} |
I'll take a look if its the same. I should be working with the 6.0 branch which is what this issue is reported against, and doesn't use
This means the bug is related to code that gets built by clang. I just tried running runtime tests and they immediately fail:
|
I just tried out Seeing same errors as @tmds using Checked builds. |
@janvorli as fyi. |
The library tests don't SIGABRT either when using a |
Using release/6.0 branch. With The Hello World console program asserts:
The assert that unexpectedly fails is:
Commenting out the
|
@janvorli can you take a look at this issue? The previous comment describes how you can reproduce it on Fedora 35. |
@tmds I'll take a look. I wonder though if it could be related to the issue we were hitting on some Ubuntu versions due to a bug in the glibc lock implementation. Here is a glibc tracking issue: https://sourceware.org/bugzilla/show_bug.cgi?id=25847. There is even a redhat tracking issue for that problem. https://bugzilla.redhat.com/show_bug.cgi?id=1889892. Glibc >= 2.27 is affected by the problem and the last time I've checked, the patch suggested in the glibc tracking issue was not merged into glibc yet. Recent versions of Ubuntu apply that patch. |
This is the dotnet issue that was tracking the problem on Ubuntu: #47700 |
I doubt that it is same issue. Based on the discussion, this issue is triggered by compiling the runtime with clang 13, and it leads to crashes or deadlocks. The glibc issue always leads to deadlock and it is very hard to repro. |
The stacktraces from For the stacktraces of the other applications that was no longer the case. In all cases there seems to be state that is used with volatile/CAS operations, which has some unexpected values. |
To test this out, I rebuilt glibc-2.34 on Fedora 35 with https://sourceware.org/bugzilla/attachment.cgi?id=12484&action=diff&collapsed=&headers=1&format=raw applied. Compiling with clang 13 and the self-built glibc, I am still seeing the original issue. |
Makes sense. |
Clang is correct. Placing an object of that type at a misaligned address would have undefined behaviour. Clang assumes you didn't do that. |
I see this comment on runtime/src/coreclr/inc/corhlpr.h Lines 322 to 324 in 2307418
Is the comment out of date, or is some other code violating the assumptions of |
@omajid the comment is correct and the structure is aligned in the data. The problem is in the way we are computing the aligned position. |
@omajid, I have no experience with the fedora packager. What is the easiest way I can use to inject my fix so that I can test it with the fedpkg? |
@janvorli Try this on a Fedora 35 machine/vm/container: git clone https://pagure.io/dotnet-sig/dotnet6.0.git
cd dotnet6.0
git checkout clang13-hack
# replace runtime-pragma-pack.patch with the actual fix (generated via `git diff` or `git format-patch`, etc)
sudo dnf build-dep dotnet5.0 -y
./build-dotnet-tarball --bootstrap 9e8b04bbff820c93c142f99a507a46b976f5c14c
# Wait about 10 minutes
fedpkg --release f35 local
# Wait about an hour Edit: if that works, can you try using the just-built source-built to compile itself? It should look like this
Alternatively, try https://github.com/dotnet/source-build#building, but before step 4 (running |
This was the clang change that changed the behavior: llvm/llvm-project@0aa0458 |
@omajid the source build passed fine. I've then tried to use the just built stuff using the second set of steps you've asked me to try, but the
I can only see the following tarballs in the directory:
|
Sorry, I forgot this step before running the second build:
That should produce a |
I've tried that and the build failed, maybe there is yet another issue there: |
I've tried this and the build passed with a fix. I've made a slightly different fix, changing the Align method so that it works on a BYTE* before it is cast to the I still want to try to figure out the other issue causing the null reference exception in the build and in one of the tests. |
@janvorli can you already make a PR with this fix? |
@tmds I am still investigating the NullReferenceException I was getting. It reproduces for the clang 13 build in one of the coreclr tests quite often and I was getting one using the original repro steps, so I wanted to make a PR that fixes all known issues. But if you'd prefer to get the fix I already have as a separate PR, I'd be fine with it. |
I am almost convinced now that the issue with NullReferenceException is also due to the clang 13 compilation. I was able to find a way to get a 100% repro in the CscBench coreclr test by disabling tiered jitting ( |
I have some additional details. The NullReferenceException problem is actually in the JIT shared library instead. If I use the binaries built by an older clang and replace just the libclrjit.so by the one built by clang 13, the issue starts happening too. |
Compiling with -fsanitize=alignment and running the tests may help catch other similar problems. |
I tried enabling sanitizers but ran into build errors in CoreCLR: #61948 (comment) |
I can wait for the full fix. I'm compiling runtime with Debug configuration and that works for me. |
With great help from @jakobbotsch, the NullReferenceException culprit was discovered. I'll create a PR fixing those two issues in a minute. |
Thanks a lot @janvorli and @jakobbotsch! |
Description
I don't fully understand everything, but wanted to file this bug to raise awareness and get tips on how to narrow down the issue.
I am using source-build on Fedora 35 to build all of .NET 6. In this environment, the source-built crossgen2 hangs when trying to build parts of ASP.NET Core.
ps
shows that this command has been running since last evening, without any progress:A
sha256sum
lets me confirm that this is the same bit-by-bit identicalcrossgen2
built from the runtime repo (and not one fetched from a nuget package):As a point of comparison, this works fine on Fedora 34.
Relevant versions of packages that I am using:
Also tested with the final (non-rc) clang packages:
clang-13.0.0-3.fc35.x86_64 llvm-13.0.0-4.fc35.x86_64
, and the result is the same.@alucryd suggested this might be a clang 13 issue: dotnet/source-build#2602 (comment)
This was also observed by another user on Fedora 35: https://pagure.io/dotnet-sig/dotnet6.0/issue/1
Reproduction Steps
On a Fedora 35 machine:
Edit: I updated these instructions to use commit
abafa176a7ac41bc6b2ebf84040bd39bca21c15a
since later versions will disable crossgen completely to try and work around this issue.Expected behavior
Build (more specifically, crossgen2) works to completion
Actual behavior
crossgen2 hangs
Regression?
This was working on Fedora 34. It's a regression somewhere in the software stack, but I am not sure where the root cause lies.
Known Workarounds
None that I know of, at least on Fedora 35.
Configuration
Other information
No response
The text was updated successfully, but these errors were encountered: