-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Object synchronization method was called from an unsynchronized block of code. #104340
Comments
Hit in #104202. Log https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-104202-merge-fe38091c4b294deca8/System.Threading.Channels.Tests/1/console.5ad789fe.log?helixlogtype=result :
|
Hit in #104332. Log https://dev.azure.com/dnceng-public/public/_build/results?buildId=728548&view=logs&j=eb3d96e8-a17f-53b1-ecea-be64723f4b14&t=d8e271dd-8628-5dad-a374-176afa32935b&l=673 :
|
I think the failure in ILC.exe started happening after #103574 (this is ILC-compiled ILC, using the LKG ILC), so that narrows it down to like two weeks worth of commits -_- ILC.exe failures don't have crashdumps because of infra limitations and neither does the libraries test failure (again because of infra design). |
I spent some time digging through pipeline logs and trying to repro locally. We've not seen this in libraries testing in the past 10+ days, only when running ILC. We picked up a new ILC on Wednesday and it hasn't reproed since. I was not able to get a local repro either (I was already running with the updated ILC though). So fingers crossed that whatever this was, it was already fixed and we just need to wait a couple more days to get enough confidence to close this issue. |
Unfortunately, it seems this has not yet been fixed. Here's a fresh hit from today:
|
I also just hit it (linux-x64 Debug NativeAOT)
|
A lot of these failures are coming from ResourceReader, which hasn't changed in a couple years. That might be just a common use of locking. It seems more likely that the problem is caused by something related to threading. I'm changing the area to System.Threading for now. |
@agocke, @MichalStrehovsky: from the stack trace it appears that this is coming from the NativeAOT specific implementation of ObjectHeader correct? Looking at the source here: runtime/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs Line 471 in 280f2a0
|
There is one hit while running native AOT smoke tests:
But the rest are in the compiler with ResourceReader. I was not able to get a repro locally either by running the compiler in a loop or by writing a targeted test that works similarly to the use of lock in ResourceReader. We create a ResourceReader that is local to the thread and never escapes so I'm not sure how someone else could get a hold of the instance and lock on it. Unfortunately I don't know much about how locking is implemented to see if we could instrument something to make it more likely to hit. Unfortunately we don't have a dump for any of these. |
Doesn't look like it's the case where another thread is holding it, but more likely that it's not able to find the relevant SyncEntryIndex from ObjHeader. But without a dump can't say, perhaps you can add some logging to check what the ObjHeader state is when it throws? |
This kind of failure could happen if we try to release a thin lock that we did not acquire. Either the lock is not acquired by anyone or acquired by some other thread. The point is that the lock is not inflated into a full Lock. If it was, then we would get the fat Lock and try releasing that. If that would fail it would throw from a different location. The lock usage is typically correctly paired. Most uses would just use Also possible that the lock was in fact inflated, but we could not retrieve the fat lock because the index of the corresponding fat lock got corrupted. This is less likely though. Corrupted index would result in some IndexOutOfRange failure, most likely. So the working theory would be that someone corrupted the syncblock and messed up the owner thread ID part. |
Some of the failures come from Wasm test runs. Is NativeAOT involved in that? Some tools are built with NativeAOT perhaps? |
I'm pretty sure they are unrelated, the infra insists on pairing that with this issue even though there is no ObjectHeader.Release in the stack trace. |
I was so far unable to reproduce the failure. Even after running the repro for hundreds of iterations. From what we see this is not a platform specific problem. We see it both on Linux and on Windows. It is also not a general problem with thin locks or with GC. The failures would be more widespread. These failures appear only in two very similar methods: In both cases the methods take So the object header must be bad coming back from deserializing methods, but only very rarely. The methods do engage in some unsafe (as in (although - why we see this on Debug and why not on CoreCLR? Maybe also happens, or maybe there are some mitigating factors) Looking at potential overruns and/or adding some asserts around questionable paths could be the next step in figuring this out. This is what I know so far about this bug. |
adding @kouvel as well, in case there are any thoughts. |
I think I figured the root cause for the failure. It is an unintended sideeffect of #92974 When we update object header bits we do One interesting thing to notice though is that we pass the location by reference. - When #92974 made the CompareExchange self-referential relying on second level intrinsic expansion. https://github.com/dotnet/runtime/pull/92974/files#r1720791671 Notice that in Debug the expansion is forced in the implementation of the CompareExchange, but the application code (in our case ILC) would still make calls to the actual method. [Intrinsic]
public static int CompareExchange(ref int location1, int value, int comparand)
{
#if TARGET_X86 || TARGET_AMD64 || TARGET_ARM64 || TARGET_RISCV64
return CompareExchange(ref location1, value, comparand); // Must expand intrinsic
00007FF7E4DA8760 55 push rbp
00007FF7E4DA8761 57 push rdi
00007FF7E4DA8762 50 push rax
00007FF7E4DA8763 48 8D 6C 24 10 lea rbp,[rsp+10h]
00007FF7E4DA8768 48 89 4D 10 mov qword ptr [rbp+10h],rcx
00007FF7E4DA876C 89 55 18 mov dword ptr [rbp+18h],edx
00007FF7E4DA876F 44 89 45 20 mov dword ptr [rbp+20h],r8d
00007FF7E4DA8773 48 8B 4D 10 mov rcx,qword ptr [location1]
00007FF7E4DA8777 8B 55 18 mov edx,dword ptr [value]
// when called form ObjectHeader.cs, a GC here can cause corruption
// of syncblock bits (or crash) if GC is compacting since RCX contains a byref to an object header
00007FF7E4DA877A 8B 45 20 mov eax,dword ptr [comparand]
00007FF7E4DA877D F0 0F B1 11 lock cmpxchg dword ptr [rcx],edx
00007FF7E4DA8781 89 45 F4 mov dword ptr [rbp-0Ch],eax
00007FF7E4DA8784 8B 45 F4 mov eax,dword ptr [rbp-0Ch]
00007FF7E4DA8787 48 83 C4 08 add rsp,8
00007FF7E4DA878B 5F pop rdi
00007FF7E4DA878C 5D pop rbp
00007FF7E4DA878D C3 ret |
I think the most obvious fix for this would be introducing We are probably too close to shipping to do that. Also considering that such helper would only be required for tricks like in That is a fairly rare scenario, probably reserved only to the runtime itself - like in the case of ObjectHeader.cs I plan to make a bit more scoped fix for 9.0 that would basically be:
|
Great catch!
I think it is still fine to introduce this intrinsic. It should be straightforward change in the JIT. Depending on the subtle details when the code may or may not be interruptible is very fragile. It is likely going to have bug tail. |
Note that this was found by running ILC with additional stress - basically an extra thread that periodically calls GC.Collect(0) and does some allocations (I wonder if that idea could be turned into something more generally useful). With extra stress the issue causes crashes in ILC that can be caught and examined in debugger. |
Plausible scenario how this issue may cause the
Note - if there is no preceding object, updating our byref will likely crash instead - while figuring how far the containing object has moved. And if the prev object does not move in this GC - everything will work ok - thus the repro is infrequent. The reason why various asserts like Basically - when compiled as a part of Debug build the actual ILC code would have asserts, but the CorLib portion of it would not. When, as the next step, we build tests using newly built Debug ILC, only at the time of releasing the ResourceReader lock we will notice if lock was not locked. (although, deceivingly, there are asserts in the code that check this earlier) |
I'll try introducing such intrinsic. As a backup, I have a more scoped solution as described in #104340 (comment) But I agree, a |
It looks like nothing needs to be changed in JIT. At least I could not find what could be changed. As a result, once the pointer-taking CompareExchange overload is introduced JIT emits same expansion for it as for byref-based, modulo GC tracking of the ref, - exactly what we want. That is without any changes to the JIT. |
Hit this again in #107811 |
The ILC failure cannot go away until toolset is updated and picks up an ILC that is not affected by a bug. I guess it has not been updated yet? @MichalStrehovsky |
Ah, could be another one of those "we're waiting for RC1 to become the build toolset" things... |
yeah believe it should be updated next week now that RC1 is released. |
This is fixed in RC2. RC1 does not have the fix for this one. |
oh yes, then not for another few weeks :( . |
Intermittent
Object synchronization method was called from an unsynchronized block of code.
exceptions.Known Issue Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=752522⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: 🔎" line.
Result validation:
Validation performed at: 7/3/2024 2:40:39 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: