Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite spin lock in Encoding.GetEncoding #33383

Closed
filipnavara opened this issue Mar 9, 2020 · 13 comments
Closed

Infinite spin lock in Encoding.GetEncoding #33383

filipnavara opened this issue Mar 9, 2020 · 13 comments
Labels
area-System.Threading tenet-reliability Reliability/stability related issue (stress, load problems, etc.) untriaged New issue has not been triaged by the area owner

Comments

@filipnavara
Copy link
Member

We've encountered weird deadlock/live lock on .NET Core 3.1 and captured a memory dump of it. The dump may contain sensitive information and hence I am not comfortable sharing it publicly but I will be happy to share it privately or dump specific information from it.

Three threads are racing on the same spin lock which is marked as locked but there doesn't seem to be any thread actually locking it.

Stack traces for the thread look like this:

Thread 18060:

 	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.SpinLock.EnterSpin(System.Threading.ReaderWriterLockSlim.EnterSpinLockReason reason) Line 1606	C#
 	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(System.Threading.ReaderWriterLockSlim.TimeoutTracker timeout) Line 681	C#
 	System.Text.Encoding.CodePages.dll!System.Text.EncodingTable.GetCodePageFromName(string name) Line 26	C#
 	System.Text.Encoding.CodePages.dll!System.Text.CodePagesEncodingProvider.GetEncoding(string name) Line 87	C#
 	System.Private.CoreLib.dll!System.Text.EncodingProvider.GetEncodingFromProvider(string encodingName) Line 94	C#
 	System.Private.CoreLib.dll!System.Text.Encoding.GetEncoding(string name) Line 315	C#
 	MailClient.Mail.dll!MailClient.Mail.MailHeaderDictionary.MailHeaderDictionary(System.IO.Stream stream)	Unknown

looping with spinIndex = 556840848.

Thread 22080:

>	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.SpinLock.EnterSpin(System.Threading.ReaderWriterLockSlim.EnterSpinLockReason reason) Line 1608	C#
 	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(System.Threading.ReaderWriterLockSlim.TimeoutTracker timeout) Line 681	C#
 	System.Text.Encoding.CodePages.dll!System.Text.EncodingTable.GetCodePageFromName(string name) Line 26	C#
 	System.Text.Encoding.CodePages.dll!System.Text.CodePagesEncodingProvider.GetEncoding(string name) Line 87	C#
 	System.Private.CoreLib.dll!System.Text.EncodingProvider.GetEncodingFromProvider(string encodingName) Line 94	C#
 	System.Net.Mail.dll!System.Net.Mail.MailAddress.MailAddress(string address, string displayName, System.Text.Encoding displayNameEncoding) Line 76	C#
 	System.Net.Mail.dll!System.Net.Mail.MailAddress.MailAddress(string address) Line 46	C#

looping with spinIndex = 0.

Thread 23232:

 	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.SpinLock.EnterSpin(System.Threading.ReaderWriterLockSlim.EnterSpinLockReason reason) Line 1606	C#
 	System.Private.CoreLib.dll!System.Threading.ReaderWriterLockSlim.TryEnterUpgradeableReadLockCore(System.Threading.ReaderWriterLockSlim.TimeoutTracker timeout) Line 681	C#
 	System.Text.Encoding.CodePages.dll!System.Text.EncodingTable.GetCodePageFromName(string name) Line 26	C#
 	System.Text.Encoding.CodePages.dll!System.Text.CodePagesEncodingProvider.GetEncoding(string name) Line 87	C#
 	System.Private.CoreLib.dll!System.Text.EncodingProvider.GetEncodingFromProvider(string encodingName) Line 94	C#
>	System.Private.CoreLib.dll!System.Text.Encoding.GetEncoding(string name) Line 315	C#
 	System.Net.Http.dll!System.Net.Http.HttpContent.ReadBufferAsString(System.ArraySegment<byte> buffer, System.Net.Http.Headers.HttpContentHeaders headers)	Unknown
 	System.Net.Http.dll!System.Net.Http.HttpContent.ReadBufferedContentAsString()	Unknown
 	System.Net.Http.dll!System.Net.Http.HttpContent.ReadAsStringAsync.AnonymousMethod__36_0(System.Net.Http.HttpContent s)	Unknown
 	System.Net.Http.dll!System.Net.Http.HttpContent.WaitAndReturnAsync<System.Net.Http.HttpContent, string>(System.Threading.Tasks.Task waitTask, System.Net.Http.HttpContent state, System.Func<System.Net.Http.HttpContent, string> returnFunc)	Unknown
 	System.Net.Http.dll!System.Net.Http.HttpContent.ReadAsStringAsync()	Unknown

looping with spinIndex = 552403664.

EncodingTable.s_cacheLock object looks like this:

  Name Value Type
  CurrentReadCount 0 int
  HasNoWaiters true bool
  IsReadLockHeld false bool
  IsUpgradeableReadLockHeld false bool
  IsWriteLockHeld false bool
  RecursionPolicy NoRecursion System.Threading.LockRecursionPolicy
  RecursiveReadCount 0 int
  RecursiveUpgradeCount 0 int
  RecursiveWriteCount 0 int
  WaitingReadCount 0 int
  WaitingUpgradeCount 0 int
  WaitingWriteCount 0 int
  _fDisposed false bool
  _fIsReentrant false bool
  _fUpgradeThreadHoldingRead false bool
  _lockID 2 long
  _numReadWaiters 0 uint
  _numUpgradeWaiters 0 uint
  _numWriteUpgradeWaiters 0 uint
  _numWriteWaiters 0 uint
  _owners 0 uint
_readEvent null System.Threading.EventWaitHandle
_spinLock {System.Threading.ReaderWriterLockSlim.SpinLock} System.Threading.ReaderWriterLockSlim.SpinLock
_upgradeEvent null System.Threading.EventWaitHandle
  _upgradeLockOwnerId -1 int
_waitUpgradeEvent null System.Threading.EventWaitHandle
  _waiterStates NoWaiters System.Threading.ReaderWriterLockSlim.WaiterStates
_writeEvent null System.Threading.EventWaitHandle
  _writeLockOwnerId -1 int

It seems that the lock object is not held by anything but it seems to spin three threads anyway.

For completeness, this is how the SpinLock object looks:

  Name Value Type
  EnterForEnterAnyReadDeprioritizedCount 0 ushort
  EnterForEnterAnyWriteDeprioritizedCount 0 ushort
  _enterDeprioritizationState 0 int
  _isLocked 1 int
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't add an area label to this Issue.

Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 9, 2020
@filipnavara
Copy link
Member Author

/cc @kouvel (not sure if the spinning in SpinLock is intended to go this high)

@jkotas jkotas added area-System.Threading tenet-reliability Reliability/stability related issue (stress, load problems, etc.) labels Mar 9, 2020
@filipnavara
Copy link
Member Author

filipnavara commented Mar 9, 2020

I cannot really confirm it yet but I am starting to suspect that thread 22080 actually got past the TryEnter and it's the one locking the SpinLock. At that point it probably lost its time quantum and the two other spinning threads are causing some abnormal scheduling behavior that never lets it run again. There's around 100 threads in total when the application reaches this point so the other threads could possibly be contributing too.

We have a machine where this issue seems to be triggered couple of time a day but not reliably. It didn't trigger by simply running Encoding.GetEncoding in parallel on multiple threads.

Update: The theory about thread 22080 is probably bogus. The disassembly shows the following:

3119DE4C  cmp         dword ptr [esi],0   << EIP points here
3119DE4F  jne         3119DE78  
3119DE51  mov         edx,1  
3119DE56  xor         eax,eax  
3119DE58  lock cmpxchg dword ptr [esi],edx  

which comes from this C# code line:

if (_isLocked == 0 && TryEnter())

Hence, it didn't get past the TryEnter.

@kouvel
Copy link
Member

kouvel commented Mar 9, 2020

ReaderWriterLockSlim is not hardened for thread-abort, or might be related to #8710. It would be difficult to tell from a crash dump what might have happened, as based on the state above it looks like info about the owner thread would have been cleared already. Are there any thread-aborts, OOMs, or reentrancy happening in the scenario?

@filipnavara
Copy link
Member Author

No thread aborts in the whole app, .NET Core doesn't have any way to trigger them explicitly anyway. No re-entrancy as far as I can tell. The only OutOfMemoryException in the dump is the one pre-allocated on startup and it has no fields filled in. The one run where we captured it under debugger didn't show it either.

@filipnavara
Copy link
Member Author

As far as the re-entrancy goes I cannot completely rule it out on the UI thread. It's not happening at the particular moment the dump is captured and there are no references to the ReaderWriterLockSlim on threads other than the spinning ones. However, some re-entrancy could have happened earlier in the run on the UI thread. I'd have to look into the implications and whether it could possibly result in a scenario where the spin lock looks locked.

@kouvel
Copy link
Member

kouvel commented Mar 9, 2020

At the moment I don't see how it would get into that state due to reentrancy, as the spin-lock is released before any wait operation that could result in reentrancy.

We have a machine where this issue seems to be triggered couple of time a day but not reliably.

Since it appears to be happening fairly frequently, an option might be to add some logging to enter/exit on the spin-waits logging the managed thread ID and look for a failure to find a missing exit.

It looks like the implementation hasn't changed significantly since 2.1 based on history, is the scenario new, or has there been enough coverage on 2.x to indicate that this is a new problem in 3.x?

@filipnavara
Copy link
Member Author

We migrated our application from .NET Framework where the Encoding code paths were significantly different from .NET Core. We went straight to 3.0/3.1 since it's a desktop app which was not supported earlier. Unfortunately I don't have any data to support whether this is a problem that was present there earlier or not.

I can try to add some custom logging but it will require me to build custom runtime so that will definitely take me couple of days (especially for building 3.1 from source).

@kouvel
Copy link
Member

kouvel commented Mar 9, 2020

I think logging would at least provide more info (provided a repro is found with logs), as I don't see a way to make progress with the current info. What I have done before in a somewhat similar case is to sync to the exact commit in the dotnet/coreclr repo that matches the commit hash from coreclr.dll from the runtime being used, and then (using an xcopy-able runtime) update only System.Private.CoreLib.dll with additional logging. If using Console, it may be necessary to reflect into it for logging, or do in-memory logging and dump only the interesting logs at some point.

@filipnavara
Copy link
Member Author

I'm thinking about changing ReaderWriterLockSlim.SpinLock._isLocked to actually track the thread id that locked it. That way I should be able to preserve most of the logic and have the most up-to-date information. I'll give it a try tomorrow.

Thanks for the idea with just replacing CoreLib. I've messed that up before because I need to match correct build configuration to the runtime (release/debug/checked) since they are not interchangeable. Hopefully I won't mess it this time.

@kouvel
Copy link
Member

kouvel commented Mar 9, 2020

I'm thinking about changing ReaderWriterLockSlim.SpinLock._isLocked to actually track the thread id that locked it.

That would hopefully point to the thread but by the time of failure it would be too late and the thread's stack would be long past the issue. Definitely would be more info, it would be interesting to see stack traces for the enter without a matching exit, although that may make it more difficult to repro if it's a timing issue. I'm sure there would be ways to progress from there, like adding try/finally's to track when there is be a missing exit, and to log the stack trace when that happens.

@kouvel
Copy link
Member

kouvel commented Mar 9, 2020

Or maybe try/catch-all+rethrow/finallys to ensure that the spin-lock is not held by the by current thread, by the end of any entry point, with logs in the catch-all and finally. Hopefully there should be no other way for the logs to not find a more actionable cause for the issue.

@filipnavara
Copy link
Member Author

We were unable to reproduce the issue again, even on the same machine. Going to close it for now. Thanks for help and ideas!

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading tenet-reliability Reliability/stability related issue (stress, load problems, etc.) untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

4 participants