-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC: MsQuicListener can randomly fail to accept new connection #55979
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsThat can be obvious with
however there are more hidden implications: ValueTask clientTask = clientConnection.ConnectAsync();
using QuicConnection serverConnection = await listener.AcceptConnectionAsync();
await clientTask; if the connect fails, the as @ManickaP pointed in #53224 if (QuicConnIsClosed(Stream->Connection) ||
Stream->Flags.Started) {
Status = QUIC_STATUS_INVALID_STATE;
goto Exit;
} That check is QUIC protocol state. If we get close from peer, there is race condition there. This is unpleasant as it is difficult to hook any retry logic to There may be more to it but this is partially caused by (MS?)QUIC design. This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.
|
Triage: Crank up the threshold limit higher for testing. |
Aside from the CPU check in if (Connection->ReceiveQueueCount >= QUIC_MAX_RECEIVE_QUEUE_COUNT) {
..
QuicPacketLogDrop(Connection, CxPlatDataPathRecvDataToRecvPacket(Datagram), "Max queue limit reached"); this may be related to existing tests - they are weird in a sense that we send lot of 1-2 bytes chunks so it seems easy or overrun the 180 limit. I wish this is something we could tune up for the test runs @nibanks but it seems to be compile time constant. |
IMHO, this is not particularly interesting. When you run things for long enough at max CPU queues build up. You cannot queue unlimited. You will continue to hit issues like this. Dropped packets are not fatal. They will get retransmitted. |
Point taken. We have quite a few randomly failing tests in CI - something not trivial to investigate. |
I put assert to QuicWorkerIsOverloaded and I can see it hit - took ~ 200 iterations in this case.
|
The impact on our tests is larger and larger, affecting entire runtime repo. We need to either find a way to disable the behavior in msquic during our tests, or we have to find a workaround to not hit it in our tests (e.g. run msquic tests serially). We need some solution ASAP (in a day or two). As an example: Even our BasicTest in #56269 is failing too often and we should disable it in CI due to the noise it is causing. |
We talk about it more today @nibanks and just to confirm and clarify. When MsQuic feels it is overloaded it would send back packages with notification instead of just ignoring the request and waiting for re-transmit, right? And is there any other reason why client may get CONNECTION_REFUSED error? |
In this case, MsQuic should inform the peer with the CONNECTION_REFUSED error. There shouldn't be any other reasons that you'd see this error from an MsQuic server. You can retry if you want. But you're just going to be adding more load to the system and there's no guarantee you'll succeed any time soon. If you have enough parallel tests all "retrying until they succeed" it's possible you never get there. Perhaps if you have "sleep and retry" model it's work better. |
I see no evidence that the system would be generally overloaded. When I run it locally on my vm I see 20-80% cpu utilization. There may be spikes and with docker and virtualization there are no real-time guarantees. At this moment this is biggest reason. for CI instability impacting all .NET contributors. There are other factors like occasional DNS failures or outages in external Azure services. But all the others are quite rare so we live with them. If we can move this to the occasional but rare category it would be still big improvement. That may be wait and retry, increasing the threshold or some other strategy. For example could MsQuic give us event so we know that particular connection is going to fail? |
Triage: partial fix in PR. We also want to disable parallelism of QUIC tests to minimize the probability of this issue, after all that's what msquic does in their test suite. |
Triage: there's nothing we can do about the root cause on our side. We could expose counters with the values of msquic statistics (either private or public). Either way, this is diagnostics and not critical. |
Now that #98361 is merged, this issue should not be an issue anymore (unless the machine is truly overloaded). Let's keep this issue open until we can validate this when MsQuic 2.4 gets released and the CI images get updated. |
Are we brave enough to run tests in parallel again? |
I would, try it, once all runners have MsQuic 2.4 (the cert validation does not run async on lower versions because of microsoft/msquic#4132) |
Should we take that fix to v2.3? Or do you not care, because you're not actually using v2.3? |
We are running 2.3 on CI, so porting it would be nice. Thanks in advance. |
Ok. I will try to get around to it this week. If you want, feel free to make a PR to |
That can be obvious with
however there are more hidden implications:
#55642
we have test fragments like
if the connect fails, the
AcceptConnectionAsync()
would hang forever since the clientTask is not awaited.There are other reasons why we hang and @stephentoub is helping with the other patterns.
as @ManickaP pointed in #53224
https://github.com/microsoft/msquic/blob/3898b2b88085a478eeb844885ea78daa9b060d06/src/core/stream.c#L207-L211
That check is QUIC protocol state. If we get close from peer, there is race condition there.
I correlate this with packet capture and I can that the failed stream/connection got REFUSED message as well.
test-refused.pcapng.zip
This is unpleasant as it is difficult to hook any retry logic to
INVALID_STATE
.There may be more to it but this is partially caused by (MS?)QUIC design.
I originally thought there may be some race condition starting the listener but clearly the message is well-formate QUIC.
When I debug the failures, there would be NO callback on the
MsQuicListener
and the logic happens inside msquic.@nibanks pointed me to
QuicWorkerIsOverloaded
and when this is true, listener would refuse to take up more work e.g. accept new connections. There may be other reasons why msquic would refuse new connection.This is also unpleasant as our tests do stress CPU and we are at point where any test can impact any test.
The text was updated successfully, but these errors were encountered: