Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

Closed
jaredpar opened this issue Aug 4, 2020 · 22 comments · Fixed by #41508
Closed

SafeSocketHandle.CloseAsIs hanging in finalizer thread #40301

jaredpar opened this issue Aug 4, 2020 · 22 comments · Fixed by #41508
Assignees
Labels
area-System.Net.Sockets bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@jaredpar
Copy link
Member

jaredpar commented Aug 4, 2020

This issue comes from investigating test failures on this Roslyn PR dotnet/roslyn#46510

At the conclusion of running the unit tests for VBCSCompiler server the xUnit process will refuse to exit. The xUnit output will indicate that the tests have completed running but the process itself will not exit. Attaching the debugger to the xUnit process and there are two threads of note that are still running:

GC Finalizer

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnceCore(int sleep1Threshold) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SafeSocketHandle.CloseAsIs(bool abortive) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.Socket.Dispose(bool disposing) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.Socket.~Socket() (Unknown Source:0)
[Native to Managed Transition] (Unknown Source:0)

.NET Sockets

System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.EventLoop() (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine..ctor.AnonymousMethod__14_0(object s) (Unknown Source:0)
System.Private.CoreLib.dll!System.Threading.ThreadHelper.ThreadStart(object obj) (Unknown Source:0)
[Native to Managed Transition] (Unknown Source:0)

The VBCSCompiler server makes heavy use of named pipes. Looking through the Socket on the finalizer thread I can confirm it's a Unix Domain socket related to the named pipes the compiler is creating (the path in the end point matches the paths we create in the tests).

Unfortunately after a day of debugging I have not been able to narrow this problem down any further:

  1. The problem only repros when running the entire test assembly. I ran each test class in the assembly and none of them individually reproduce the issue. Have to run them as a group.
  2. Went through a cycle of causing the hang, identifying the test which created the socket that was hung in the finalizer, disable that test, re-run the assembly. Did this for about five different tests and it had no impact on the hang.
  3. Thought this may be related to WaitForConnectionAsync does not respond to cancellation #40289 so I tried aggressively calling NamedPipeServerStream.Dispose on any instance that was hung in a WaitForConnectionAsync call.

None of these has had any impact though. I've also been unsuccessful in constructing a more concise repro. 😦

More than happy to provide any info to make tracking this down easier.

Repro Information:

  • Runtime: .NET 5 Preview 7
  • OS: Ubuntu 18.04, Ubuntu 18.04 via WSL2, OSX
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Net.Sockets untriaged New issue has not been triaged by the area owner labels Aug 4, 2020
@ghost
Copy link

ghost commented Aug 4, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

cc: @tmds

@scalablecory scalablecory added this to the 5.0.0 milestone Aug 6, 2020
@scalablecory scalablecory removed the untriaged New issue has not been triaged by the area owner label Aug 6, 2020
@antonfirsov antonfirsov self-assigned this Aug 6, 2020
jaredpar added a commit to jaredpar/roslyn that referenced this issue Aug 6, 2020
@jaredpar
Copy link
Member Author

jaredpar commented Aug 6, 2020

Did my best to narrow this down to a smaller problem. Created a branch where you can see the delta between our unit tests passing and hanging. This commit shows that it's simply enabling one new test to run on Linux that causes this hang.

To repro this do the following:

> git clone https://github.com/jaredpar/roslyn -o jaredpar
> git checkout -B repro jaredpar/repro/pipe-hang
> cd src/Compilers/Server/VBCSCompilerServerTests
> dotnet build
> dotnet msbuild /t:Test

Couple notes:

  1. About 1/2 of the time this will cause the tests to fail, the other half it will hang in the finalizer
  2. Running the new test alone is not sufficient to demonstrate the hang, have to run the entire suite (ensure why).

A brief description of what this particular test is doing:

  • Opening four NamedPipeServerStream instances on the same pipe name
  • Creates one NamedPipeClientStream instance that fully connects
  • Dispose three of the NamedPipeServerStream instances, does not dispose the connected one
  • Disposes the NamedPipeClientStream
  • That means after the test completes one of the NamedPipeServerStream instances remains undisposed and it appears this is the one that ends up stuck in the finalizer thread.

@tmds
Copy link
Member

tmds commented Aug 24, 2020

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnceCore(int sleep1Threshold) (Unknown Source:0)
System.Net.Sockets.dll!System.Net.Sockets.SafeSocketHandle.CloseAsIs(bool abortive) (Unknown Source:0)

Since .NET 5, SafeHandles are ref'ed for on-going Socket operations (instead of calling DangerousGetHandle). If we are spinning here, there is a on-going operation against the Socket that keeps the handle from being released. It would be interesting to identify the operation that is on-going on some other thread. It may help put together a reproducer.

Aborting the on-going Socket operations is attempted by the TryUnblockSocket. If the Socket doesn't have CLOEXEC this method no-ops. This method is best-effort. There are tests that verify abortive behavior for UDP/TCP, but no tests for domain sockets.

private unsafe bool TryUnblockSocket(bool abortive)
{
// Calling 'close' on a socket that has pending blocking calls (e.g. recv, send, accept, ...)
// may block indefinitely. This is a best-effort attempt to not get blocked and make those operations return.
// We need to ensure we keep the expected TCP behavior that is observed by the socket peer (FIN vs RST close).
// What we do here isn't specified by POSIX and doesn't work on all OSes.
// On Linux this works well.
// On OSX, TCP connections will be closed with a FIN close instead of an abortive RST close.
// And, pending TCP connect operations and UDP receive are not abortable.
// Unless we're doing an abortive close, don't touch sockets which don't have the CLOEXEC flag set.
// These may be shared with other processes and we want to avoid disconnecting them.
if (!abortive)
{
int fdFlags = Interop.Sys.Fcntl.GetFD(handle);
if (fdFlags == 0)
{
return false;
}
}
int type = 0;
int optLen = sizeof(int);
Interop.Error err = Interop.Sys.GetSockOpt(handle, SocketOptionLevel.Socket, SocketOptionName.Type, (byte*)&type, &optLen);
if (err == Interop.Error.SUCCESS)
{
// For TCP (SocketType.Stream), perform an abortive close.
// Unless the user requested a normal close using Socket.Shutdown.
if (type == (int)SocketType.Stream && !_hasShutdownSend)
{
Interop.Sys.Disconnect(handle);
}
else
{
Interop.Sys.Shutdown(handle, SocketShutdown.Both);
}
}
return true;
}

Note that async Socket operations always nicely abort because there are no on-going Socket operations that keep the SafeHandle ref'ed.

@antonfirsov
Copy link
Member

antonfirsov commented Aug 24, 2020

It would be interesting to identify the operation that is on-going on some other thread.

I'm currently trying to figure that out, but my gut feel is that spinning in the finalizer thread is a bad idea in any case, and we should probably rethink the way CloseAsIs works, regardless of the outcome of the investigation.

@jaredpar
Copy link
Member Author

@tmds

It would be interesting to identify the operation that is on-going on some other thread. It may help put together a reproducer.

Really wish I could've found a simpler repro here that would more clearly identify the operation occurring. Spent several hours on this but was unable to do so.

More than happy to help out with any questions on the current repro.

@antonfirsov
Copy link
Member

antonfirsov commented Aug 26, 2020

Looks like the issue is not caused by outstanding blocking calls. I have a smaller repro now:
https://gist.github.com/antonfirsov/ce0cb4992e115bb4d6e8dd6862fd6780

Here is what is happening in my understanding:

  1. SafePipeHandle.Unix increases the reference counter on SocketSafeHandle:
    _namedPipeSocketHandle.DangerousAddRef(ref ignored);
  2. NamedPipeClientConnectionHost seems to leak (= not dispose) some NamedPipeServerStream instances. (@jaredpar I'm wondering if this only happens in the repro branch or is this a bug in the compiler server?)
  3. A Socket finalizer is called before the finalizers of the "owner" SafePipeHandle and NamedPipeServerStream.
  4. Since the reference count is not 0, SocketSafeHandle is not released. CloseAsIs will keep spinning, blocking the finalizer thread, preventing the finalization of SafePipeHandle (therefore the release of SocketSafeHandle):
    while (!_released)
    {
    // The socket was not released due to the SafeHandle being used.
    // Try to make those on-going calls return.
    // On Linux, TryUnblockSocket will unblock current operations but it doesn't prevent
    // a new one from starting. So we must call TryUnblockSocket multiple times.
    canceledOperations |= TryUnblockSocket(abortive);
    sw.SpinOnce();
    }

I have an idea for a PR that would remove the spinning.

For some reason the issue does not happen with 3.1.

@jaredpar
Copy link
Member Author

NamedPipeClientConnectionHost seems to leak (= not dispose) some NamedPipeServerStream instances. (@jaredpar I'm wondering if this only happens in the repro branch or is this a bug in the compiler server?)

This should not be happening in the product anymore. After filing the bug I was able to find a few other places these were leaking, made a push to get rid of them, and I believe I got them all at this point. It's possible it's still happening in the tests though.

@tmds
Copy link
Member

tmds commented Aug 26, 2020

Since the reference count is not 0, SocketSafeHandle is not released.

👍 Makes sense.

I have an idea for a PR that would remove the spinning.

The spinning deals with a race condition that occurs as follows:

  1. Thread A refs the SafeHandle
  2. Thread B calls TryUnblockSocket
  3. Thread A uses the handle for a Socket operation

If Thread B does not call TryUnblockSocket again (which it does now by spinning), the operation on thread A is not aborted.

@antonfirsov
Copy link
Member

@tmds I may miss something important but:
do we really need to abort the operation when we are doing an abortive close from a finalizer? Isn't it better to ignore it, and enforce the abortive close?

@tmds
Copy link
Member

tmds commented Aug 27, 2020

do we really need to abort the operation when we are doing an abortive close from a finalizer

When the Socket gets closed (by the user explicitly, or on the finalizer thread) we want all threads that are performing a blocking operation on the Socket to unblock (and return something like SocketError.OperationAborted).

@tmds
Copy link
Member

tmds commented Aug 27, 2020

Anton, I think we could dup the SafeSocketHandle so the SafePipeHandle has its own kernel-level reference. wdyt?

@antonfirsov
Copy link
Member

we want all threads that are performing a blocking operation on the Socket to unblock (and return something like SocketError.OperationAborted).

@tmds what will happen to those ongoing operations if we close the underlying handle?

I think we really need to distinguish between the finalizer and a normal Dispose. It already indicates a user bug, if a finalizer is being executed. Instead of being 100% correct handling blocking operations I would only aim for preventing handle leak in those cases.

I believe the way CloseAsIs works now in a very unsafe way:

  • We assume that !_released indicates an ongoing blocking operation. This is not always true.
  • Currently any user code relying on DangerousRelease in a finalizer is risking to block the finalizer thread (a fatal bug!)

I would prefer to find a way to solve this in Sockets instead of just changing the way NamedPipes are utilizing sockets. As an alternative, we can document that user code should never DangerousAddRef / DangerousRelease on socket.SafeHandle, but that sounds weird to me.

@tmds
Copy link
Member

tmds commented Aug 27, 2020

@tmds what will happen to those ongoing operations if we close the underlying handle?

With 3.1 on-going operations didn't ref the SafeHandle. With 5.0 they do, and as a consequence they prevent the SafeHandle from getting released (that is: actually closing). So we need something to unblock the operations.

I think we really need to distinguish between the finalizer and a normal Dispose

In case of a finalizer, we may assume there are no on-going operations, because those would prevent the SafeHandle from being finalized, right? So we may skip the TryUnblockSocket and prevent blocking the finalizer thread.

@antonfirsov
Copy link
Member

antonfirsov commented Aug 27, 2020

With 5.0 they do, and as a consequence they prevent the SafeHandle from getting released (that is: actually closing)

What I mean is that in a finalizer (instead of waiting for the handle to be released) we may want force-close the handle, independently of the ref-count/release mechanism.

In case of a finalizer, we may assume there are no on-going operations, because those would prevent the SafeHandle from being finalized, right?

Unfortunately, this is not true. In this particular case we are in the finalizer of the Socket instance, so ongoing P/Invoke-s have no effect on whether it's executed or not.

Edit: on the other hand, this should be a very rare corner-case. We expect blocking socket operations to be triggered from managed Socket calls, I think it's unlikely the Socket gets finalized in the meanwhile.

@antonfirsov
Copy link
Member

antonfirsov commented Aug 27, 2020

I just noticed this comment in TryUnblockSocket:

// Calling 'close' on a socket that has pending blocking calls (e.g. recv, send, accept, ...)
// may block indefinitely. This is a best-effort attempt to not get blocked and make those operations return.

What does "indefinitely" mean here? Is it only for the time the blocking socket call is executing? Is it guaranteed to eventually return? Since it's a rare corner-case, in a finalizer it might be acceptable to wait for a blocking call. (As an alternative of of keeping the behavior where SafeSocketHandle.DangerousAddRef is risking a finalizer thread hang.)

Edit: I found #37873 now, which answers my question. Still hoping we can find a way to remove the spinning instead of working around this in SafePipeHandle. It's not good that we don't distinguish between blocking syscalls, and other operations changing SocketSafeHandle's refcount.
For example, we can set/unset a flag on SafeSocketHandle when doing blocking syscalls.

@antonfirsov
Copy link
Member

antonfirsov commented Aug 27, 2020

~Socket() was a NOP before the following PR:
dotnet/corefx#38499

I wonder if the concern of indefinite spinning (or blocking close) was discussed in that PR? SafeSocketHandle.DangerousAddRef was probably safe before that.
/cc @wfurt

@antonfirsov
Copy link
Member

Just an illustration of the bigger issue we have here. It's this easy to create a finalizer thread hang now:

static void Main(string[] args)
{
    CreateSocket();
    Console.WriteLine("Dying!");
    GC.Collect();
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void CreateSocket()
{
    Socket socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
    bool dummy = false;
    socket.SafeHandle.DangerousAddRef(ref dummy);
}

@geoffkizer
Copy link
Contributor

This is definitely related to the fix we made in dotnet/corefx#38499. Before that we didn't really have a socket finalizer, so we would not have hit this issue.

That said, we cannot undo the change in dotnet/corefx#38499 (at least not without making other, extensive changes). The reasoning is explained here: #29327 (comment)

I think the best fix for 5.0 is to simply not try to cancel outstanding operations when the Socket finalizer is called, as was suggested above. It's possible that this could lead to some unexpected behavior in weird corner cases, but I think those corner cases are restricted to the case where you do a DangerousAddRef without a corresponding DangerousRelease. Since we are currently hanging the finalizer thread in that case, we can probably live with a few quirky corner cases for now.

I do think we should revisit all of this in the future. Ideally we shouldn't need to have a finalizer on Socket. But changing that would take some effort and research etc.

@tmds
Copy link
Member

tmds commented Aug 28, 2020

The spinning was introduced as part of dotnet/corefx#38804. In that PR, Socket PInvoke changed to use SafeHandles instead of DangerousAddRef.

Because previously there were no out-standing references, the handle was released as soon as the Socket got Closed. On Windows, this caused on-going operations to abort, but on Linux those operations would just keep on blocking.

Due to the use of SafeHandle, the handle release no longer occurs on Close when there are on-going operations. That became the responsibility of TryUnblockSocket.

The spinning was added to deal with the race described in #40301 (comment). But I did not consider someone other than Socket operations could DangerousAddRef, which leads to infinite spinning.

Based on what was said I think we are safe to not call TryUnblockSocket when Closing on a finalizer thread, because there should be no on-going Socket operations that need to be aborted.

@marklio marklio added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Sep 1, 2020
@jaredpar
Copy link
Member Author

jaredpar commented Sep 1, 2020

🎉

@karelz
Copy link
Member

karelz commented Sep 10, 2020

Fixed also in 5.0 in RC1 (PR #41688) and in RC2 (PR #41747)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants