-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Socket: improve cross-platform behavior on Dispose #38804
Conversation
I'm first letting the tests of #37486 run on CI to ensure they are asserting current corefx behaviour on Windows. |
CI passes on Windows for all test except |
Most of the CI test failures that we saw with the original PR were from Outerloop. |
/azp run |
Azure Pipelines successfully started running 4 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 4 pipeline(s). |
CI failures:
|
@davidsh @stephentoub @krwq @wfurt maybe we can merge this after a few more CI runs? |
src/System.Net.Sockets/src/System/Net/Sockets/AcceptOverlappedAsyncResult.Windows.cs
Outdated
Show resolved
Hide resolved
src/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Unix.cs
Outdated
Show resolved
Hide resolved
src/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Unix.cs
Outdated
Show resolved
Hide resolved
src/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Unix.cs
Outdated
Show resolved
Hide resolved
// What we do here isn't specified by POSIX and doesn't work on all OSes. | ||
// On Linux this works well. | ||
// On OSX, TCP connections will be closed with a FIN close instead of an abortive RST close. | ||
// And, pending TCP connect operations and UDP receive are not abortable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to make things worse than they are today? With this change, Dispose'ing of the associated SafeHandle will never unblock a synchronous Close/Accept, because the SafeHandle will have been add-ref'd. Might it have before this change, even if it was racy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are no regressions. I've opened #39079 to check how tests behaved before Socket changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are seeing some issues with this change in Kestrel in the following test: https://github.com/aspnet/AspNetCore/blob/3a5dbad3bcf66750000943194f7270a2df3cc33d/src/Servers/Kestrel/test/FunctionalTests/ResponseTests.cs#L905. This is only failing on linux.
I'm not sure what specifically in this change caused the regression, but we saw this test fail ~30% of the time in Kestrel. More information here: dotnet/aspnetcore#13532
Here are the Kestrel logs we are seeing:
| [0.001s] TestLifetime Information: Starting test ClientCanReceiveFullConnectionCloseResponseWithoutErrorAtALowDataRate at 2019-09-09T09:05:06
| [0.068s] Microsoft.AspNetCore.Hosting.Diagnostics Debug: Hosting starting
| [0.071s] Microsoft.AspNetCore.Hosting.Diagnostics Debug: Hosting started
| [0.071s] Microsoft.AspNetCore.Hosting.Diagnostics Debug: Loaded hosting startup assembly Sockets.FunctionalTests, Version=42.42.42.42, Culture=neutral, PublicKeyToken=adb9793829ddae60
| [0.072s] Microsoft.AspNetCore.Server.Kestrel Debug: TestServer is listening on port 34479
| [0.074s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLPL0J6HPQ7C" accepted.
| [0.077s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLPL0J6HPQ7C" started.
| [0.078s] Microsoft.AspNetCore.Hosting.Diagnostics Information: Request starting HTTP/1.1 GET http:/// - -
| [11.186s] Microsoft.AspNetCore.Hosting.Diagnostics Information: Request finished HTTP/1.1 GET http:/// - - - 200 - - 11107.8383ms
| [11.187s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLPL0J6HPQ7C" disconnecting.
| [11.188s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLPL0J6HPQ7C" stopped.
| [11.188s] Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets Debug: Connection id "0HLPL0J6HPQ7C" sending FIN because: "The Socket transport's send loop completed gracefully."
| [14.193s] Microsoft.AspNetCore.Hosting.Diagnostics Debug: Hosting shutdown
| [14.886s] TestLifetime Information: Finished test ClientCanReceiveFullConnectionCloseResponseWithoutErrorAtALowDataRate in 14.885012699999999s
This also could be an issue with how we are handling socket shutdown. Here is the shutdown code in Kestrel: https://github.com/aspnet/AspNetCore/blob/master/src/Servers/Kestrel/Transport.Sockets/src/Internal/SocketConnection.cs#L337
@tmds can you give me a TL;DR on what changed here in Socket.Shutdown + Socket.Dispose on Unix?
cc @halter73
if (fdFlags == 0) | ||
{ | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here. Is this possibly making things worse?
src/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Unix.cs
Outdated
Show resolved
Hide resolved
src/System.Net.Sockets/src/System/Net/Sockets/SafeSocketHandle.Windows.cs
Show resolved
Hide resolved
We currently have a 10% regression that we’re chasing down. So I’d wait on this if we’re. Worried it could regress even more |
@stephentoub thank you for reviewing! I've addressed comments. For two open questions:
I ran tests of the PR without the other changes in #38804. The tests I'm most interested in (ConnectGetsCanceledByDispose, UdpReceiveGetsCanceledByDispose on macOS) show neither a pass/fail status: https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F39079~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190702.1/workItem/System.Net.Sockets.Tests. I'm not sure what the CI result means then, but it was not passing.
This is to avoid changing non-CLOEXEC sockets that may be shared with other Processes. I've augmented this to skip the check when the close is abortive (like when the Dispose happens from the finalizer). |
Actually I believe it was. For infrastructure reasons, passing data is at least temporarily not being stored. And if you look at the infrastructure log, you'll see the full xunit output, which shows:
|
Thanks. The Connect and UdpReceive tests were skipped on macOS. I've included them now and another CI build is in progress: #39079. |
No regression on macOS for {Connect,UdpReceive}GetsCanceledByDispose. Pre-PR:
|
@geoffkizer @davidfowl can you please look into this PR? A previous attempt was reverted after being merged just before preview 7. It would be good if this gets some more time on master. |
Thanks, Tom. At this point I think it's going to miss 3.0, and we'll check it in first thing after master branches. The bar is high right now, and we just don't have any more runway to deal with fallout; while I really want this change in, it has the potential to be destabilizing, have unintended perf regressions, etc. |
* Add tests * Don't skip SyncSendFileGetsCanceledByDispose on .NET Framework * Implement using code from previous PR * Update SocketErrors when CleanedUp for APM APIs * Cleanup SO_TYPE socket length handling (Fixes https://github.com/dotnet/corefx/issues/38750) * SyncSendFileGetsCanceledByDispose: remove SkipTargetFramework NetFramework * PR feedback * Fix comment typo * Also wait for async operations, and wait longer on each retry * clear sockaddr using memset * Fix build * Only perform CLOEXEC check for non-abortive close Commit migrated from dotnet/corefx@9b2f7a8
Dispose
CancelIoEx
SafeHandle
instead ofIntPtr
(needed on Windows forCancelIoEx
unblocking)SocketError
for aborted operationsbased on #37486
fixes #22564
fixes #26034
CC @wfurt @stephentoub @geoffkizer @davidsh @krwq