Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on arm/arm64 with mono due to missing memory barriers #24932

Closed
BrzVlad opened this issue Feb 19, 2018 · 14 comments
Closed

Crash on arm/arm64 with mono due to missing memory barriers #24932

BrzVlad opened this issue Feb 19, 2018 · 14 comments

Comments

@BrzVlad
Copy link
Member

BrzVlad commented Feb 19, 2018

Version Used:

2.6.0. Doesn't matter

Steps to Reproduce:

I've seen random roslyn crashes / compilation errors with mono ever since switching to it for building the BCL (on arm/arm64 linuxes)
For reproduction you just need to build mono, but the bug is extremely evasive. I have reliably reproduced it on a scaleway.com arm64 vm and also on a compulab utilite2. I can provide access to a machine where it should be easy to reproduce.

Expected Behavior:

Don't crash

Issue:

As far as I've investigated the crashes and my small understanding of object pools, it looks like missing memory barriers here https://github.com/dotnet/roslyn/blob/master/src/Dependencies/PooledObjects/ObjectPool%601.cs#L194

The thread that owns the object and puts it back for reuse doesn't use any synchronisation primitives, so the thread that will reuse this freed object is not guaranteed to see the correct state of the freed object.

@marek-safar
Copy link
Contributor

@jaredpar who owns this code?

@jaredpar
Copy link
Member

@marek-safar this is owned by the compiler. Will let @gafter, @tmat, @VSadov comment on this.

@jaredpar jaredpar added this to the 15.7 milestone Feb 19, 2018
@akoeplinger
Copy link
Member

akoeplinger commented Feb 19, 2018

Making _firstItem volatile fixes the crash (confirmed on the Scaleway Cavium ThunderX box):

-        private T _firstItem;
+        private volatile T _firstItem;

@VSadov
Copy link
Member

VSadov commented Feb 19, 2018

The reusing thread will have to do Interlocked.CompareExchange to check out the instance. That is a full fence.
If the pool did not syncronize, it would be broken on any platform.

Volatile may effect timing. It could get pretty expensive on ARM.

My guess, it is a race somewhere else and making things slower “fixes” them.

@VSadov
Copy link
Member

VSadov commented Feb 19, 2018

Reference write should be a “release” anyways. Interlocked is used here because of object uniqueness.

If reference writes to heap are not releases, a lot of things would be broken.

Roslyn assumes CLR2.0 memory model. - at least tries to. Since we test on x86/64 only, we cant be sure it there are no accidental strong ordering dependencies.

It could be one of those.

Or just a race which we do not see on other platforms due to timing.

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 19, 2018

@VSadov It doesn't matter if the reusing thread does a CAS and adds a memory barrier, since the memory barrier orders only on the calling thread. The stores on the thread that released the object can still be committed in any order.

@VSadov
Copy link
Member

VSadov commented Feb 19, 2018

One of the best explanation of CLR2.0 memory ordering that I know is http://joeduffyblog.com/2007/11/10/clr-20-memory-model/

It does not mention threads. In that model a fence is a fence and effects are global.

It does, however, mention that the description is a practical rationalization of the actual implementation and it is unclear whether Mono provides the same guarantees.
So there is a chance that we are observing platform/JIT differences.

@VSadov
Copy link
Member

VSadov commented Feb 19, 2018

It still seems more likely that we just have an ordinary race somewhere...

@jaredpar jaredpar modified the milestones: 15.7, 15.8 Mar 23, 2018
@jaredpar jaredpar modified the milestones: 15.8, 16.0 Jul 11, 2018
@xoofx
Copy link
Member

xoofx commented Dec 14, 2018

FYI, just to bump this, related to a compilation problem I have when running mono msbuild and Roslyn on an Raspberry ARM64, mono+Roslyn is giving the following exception:

     System.NullReferenceException: Object reference not set to an instance of an object
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ComputeStrongNameKeys () <0xffff85aae7f8 + 0x000e8> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.get_StrongNameKeys () <0xffff85aae994 + 0x00087> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ValidateAttributeSemantics (Microsoft.CodeAnalysis.DiagnosticBag diagnostics) <0xffff85aaec40 + 0x00023> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ForceComplete (Microsoft.CodeAnalysis.SourceLocation locationOpt, System.Threading.CancellationToken cancellationToken) <0xffff85aafcf0 + 0x000e7> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0

Which seems to be related to this issue https://bugzilla.xamarin.com/show_bug.cgi?id=56546#c2

I don't see how to pass the /parallel- switch through msbuild Csc task, if anyone has a clue.

For what is worth, dotnet 3.0 is running fine with csc, but the fact is that it is running super slow on ARM64, so the concurrency bug might not be visible with it

@VSadov
Copy link
Member

VSadov commented Dec 14, 2018

FWIW - the only reasonably testable memory model at the time was CLR2.0, so Roslyn targets that. It was an explicit decision. (discussed in meetings and such).
In particular reference stores are assumed to have release semantics (as in std::memory_order_release).

Roslyn has a lot of shared state - basically embraces it.
You will see variations of the following pattern all over the code base:

// naked store!
//  `stlr` is assumed  (or `dmb ish; str`, on v7)
maybe_read_in_other_threads = new Something();

If the store-release assumption above is weakened, a good deal of codebase may need to be reviewed and carefully volatile-ized.
It should be doable, but could be tedious.

@CoffeeFlux
Copy link

Pinging this thread again as this is still a pain point for Mono CI. Most of the random CI failures I get are from CodeAnalysis on ARM, and so I end up re-running lanes a whole bunch as a result. I believe we're working on a hack on our end to minimize the issue, but wanted to see if there's any chance of this being addressed upstream.

@jaredpar
Copy link
Member

jaredpar commented Jun 7, 2019

The Roslyn compiler, and the CoreFX libraries we depend on, are designed for the CLR memory model. Moving it to support the Mono / ECMA memory model is a non-trivial task. It's on the order of one to two months to get it done. This is not to say we should or should not do it, just that the work is much more significant than this thread makes it out to be. It's the equivalent of a medium sized language feature.

Stepping back though I'm not sure changing Roslyn is the right approach here given our .NET 5 plans. The idea is to make it so customers can use Mono as a runtime for their .NET applications. Are we suggesting that all customers need to make the same memory model adjustments as Roslyn is being asked to do? Or should we be looking to get Mono in line with the CLR memory model here?

@jinujoseph jinujoseph modified the milestones: 16.0, 16.3 Jun 9, 2019
@jcouv jcouv modified the milestones: 16.3, Compiler.Next Jul 11, 2019
@selalerer
Copy link

I am not sure it is related to this issue but we try to use CSharpCompilation.Emit() on ARM64 and sometimes our process crash. Once at least it outputted this error:

error CS8113: Invalid hash algorithm name: 'System.Byte[]'

The same code works consistently on Windows.

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 3, 2022

Doesn't look like this issue should be open. We ended up adding barriers on the runtime side on mono.

@BrzVlad BrzVlad closed this as completed Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants