-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
packet loss and solutions #1012
Comments
More information: The Hold Nic Buffers behavior in the Windows protocol stack is not documented on Microsoft's website. However, based on the results of my analysis using Windbg:
By executing |
|
I think this is a special optimization that Microsoft has implemented for the packet receiving process: copying directly from the network card driver's buffer to the user-mode buffer provided by wsarecv, bypassing any intermediate allocations and copies. However, when the number of pre-allocated RX buffers by the network card driver is fixed, this special optimization can lead to packet loss in some cases. In afd.sys!AfdReadRegistry(called by afd!DriverEntry), the logic is: (according to windbg/ida with public symbols) ....
memset(&VersionInformation, 0, sizeof(VersionInformation));
VersionInformation.dwOSVersionInfoSize = 284;
v3 = RtlGetVersion((PRTL_OSVERSIONINFOW)&VersionInformation) < 0 || VersionInformation.wProductType == 1;
AfdDoNotHoldNICBuffers = v3;
.... typedef enum
{
VER_NT_WORKSTATION = 0x00000001,
VER_NT_DOMAIN_CONTROLLER = 0x00000002,
VER_NT_SERVER = 0x00000003
} OS_TYPE; if VersionInformation.wProductType == 1(VER_NT_WORKSTATION ), then AfdDoNotHoldNICBuffers will be 1, means "hold nic buffers" behavior get disabled for client OS by default. For VER_NT_SERVER or VER_NT_DOMAIN_CONTROLLER, it's value will be 0, unless value 1 is explicitly specified in the registry. |
When using NdisMAllocateSharedMemoryAsyncEx to allocate memory, the NDIS kernel thread for asynchronous memory allocation consumes a significant amount of CPU(worst-case): It seems to be searching for contiguous physical pages (even though physical memory usage was actually quite low, with 2GB still free). The issue of memory allocation consuming CPU resources appears to be related to Windows' large page memory management at first glance. Further investigation is required, but progress will be slow without the right to access Windows source code or private symbols. |
Regarding the issue of NdisMAllocateSharedMemoryAsyncEx, I have conducted the following test so far: Allocating memory page by page according to the 4K PAGE_SIZE, intended to avoid the Windows kernel's search for contiguous physical pages. However, when large pages are enabled, allocating 4K pages is still inefficient and can be even slower. |
@80886 First of all, thank you for the analysis, this is a good point to start from, the problem that you raise is real and actual.
|
I'll provide some information that I can recall for now, and some details will need to be confirmed next week.
|
@80886 I do not suggest to use the vhost=off in the production environment. I think that if we're looking for the solution in the driver, the solution should be the same regardless how exactly the problem is triggered. I'm looking for suitable repro scenario also with vhost, yet I'm able to reproduce it with iperf3, I see the losses (~200) in the first session after boot and I'm not sure that all the losses are reflected in TX.dropped. |
Thanks very much for your guidance. Regarding the iperf3 700(25%)/120(0%) packet loss: I can also reproduce it with solution 1 commit. The packet loss in our production environment first occurred about two years ago, and at that time, our temporary emergency fix was indeed also to use NDIS_RECEIVE_FLAGS_RESOURCES, however, compared to the solution 1 commit, ther is a key difference is that instead of using DPCContext, we call pBufferDescriptor->Queue->IsRxBufferShortage() directly within ProcessReceiveQueue(). I applied the same temporary fix as we did back then, commit here. With this commit, running the iperf3 -N test again in the same environment, the packet loss decreased from 1,713,780 to 6,109 compared to the previous solution 1 commit. test environment: vhostuser, 2 queues + 256 queue_size , 4 CPUs, 4 rss queues; test script: run @echo off
setlocal
set /a "count=0"
:loop
if %count%==100 goto end
echo %count%
.\iperf3.exe -c `$guest_ip` -P 4 -N -l 10
set /a "count+=1"
goto loop
:end
endlocal In my test environment, if I run iperf only once, regardless of which commit is used or what value the parameters are set to, packet loss may or may not occur, which is not convenient for comparison. If I set a single session to run for a long time using the -t parameter, such as for 3 minutes, packet loss mostly occurs only in the first few seconds, with no losses afterward. This is the reason for the script to loop 100 times, with each iteration lasting 10 seconds. test result: solution 1 commit with MinRxBufferPercent set to 25%: [root@kvm ~]# virsh domifstat 88352212-4ef9-4235-b351-334a5b9e81a2 veth_271982a2
veth_271982a2 rx_bytes 1906148869
veth_271982a2 rx_packets 19802095
veth_271982a2 rx_errs 0
veth_271982a2 rx_drop 1713780 <<<============
veth_271982a2 tx_bytes 388468430
veth_271982a2 tx_packets 7117675
veth_271982a2 tx_errs 0
veth_271982a2 tx_drop 0 new commit with MinRxBufferPercent set to 25%: [root@kvm ~]# virsh domifstat 08730dcd-42e5-4040-b58d-cca2b87b1484 veth_54099Ddd
veth_54099Ddd rx_bytes 2246687914
veth_54099Ddd rx_packets 26851892
veth_54099Ddd rx_errs 0
veth_54099Ddd rx_drop 6109 <<<============
veth_54099Ddd tx_bytes 687744795
veth_54099Ddd tx_packets 12693977
veth_54099Ddd tx_errs 0
veth_54099Ddd tx_drop 0 new commit, set to 0%: [root@kvm ~]# virsh domifstat 08730dcd-42e5-4040-b58d-cca2b87b1484 veth_54099Ddd
veth_54099Ddd rx_bytes 1340918784
veth_54099Ddd rx_packets 9009548
veth_54099Ddd rx_errs 0
veth_54099Ddd rx_drop 1385165 <<<========
veth_54099Ddd tx_bytes 227663224
veth_54099Ddd tx_packets 4130477
veth_54099Ddd tx_errs 0
veth_54099Ddd tx_drop 0 new commit set to 0%, set afd registry param DoNotHoldNicBuffers to 1: [root@kvm ~]# virsh domifstat 88352212-4ef9-4235-b351-334a5b9e81a2 veth_271982a2
veth_271982a2 rx_bytes 2043782245
veth_271982a2 rx_packets 22922345
veth_271982a2 rx_errs 0
veth_271982a2 rx_drop 449324 <<<========
veth_271982a2 tx_bytes 441322380
veth_271982a2 tx_packets 8094288
veth_271982a2 tx_errs 0
veth_271982a2 tx_drop 0
(I conducted this test four times; on one occasion there was zero packet loss, while the other three instances each had around 500,000 losses, it appears that set the registry param to 1 has some effect, but not as good as expected.)
Data discrepancies can occur from test to test, likely influenced by the behavior of the protocol stack or TCP congestion control. However, the general trend is stable: in the tests mentioned above, the new commit with the parameter set at 25% invariably shows improved performance. The new commit probably still needs modification, because it calls isRxBufferShortage() for every pBufferDescriptor in ProcessReceiveQueue(), although the overhead should not be too significant in theory. Regarding suggestions for RSS hits/misses: I will test it in the next couple of days. Regarding HCK: |
@80886 Do you have some news that you want to share? |
I added some code to stat rxMinFreeBuffers, rxIndicatesWithResourcesFlag, and totalRxIndicates(commit link), and then conducted some tests(typical tcp case). Test env: 16 cpus(Intel 8255C), 8 queues, 16 rss, max bandwidth 10Gbit/s Command line:
run 3 times and take the average value:
From the table above:
Therefore, I think that for a 10Gb typical TCP case, a reasonable configuration of queue_size is better, and should not rely on MinRxBufferPercent too much. MinRxBufferPercent is mainly used for tcp_nodelay/AfdHoldNicBuffers scenarios, and a default value of 50% appears to be acceptable. |
@80886 Thank you for sharing the results. |
The reason I didn't use ntttcp -ndl is that in this test, my purpose is to observe the impact on CPU usage and bandwidth in a typical TCP case (TCP aggregation) , when set to 25%, 50%, 75%, and 100% respectively. In scenarios with TCP_NODELAY, although there are a large number of packets, bue the packet size is smaller and bandwidth usage is lower, so I am more concerned with the mitigation effects on packet loss. (previous iperf -N test). As for the -rb 64k -sb 64k, because Windows uses dynamic socket buffer size by default, I had encountered performance(CPU/BW) uncertainties introduced by dynamic buffer adjustments on some earlier versions of Windows years ago, which often required more tests to obtain stable results (In newer versions of Windows, dynamic buffer tuning may not have this issue, but I hadn't test it), so when doing TCP performance testing, I habitually set a fixed socket buffer size in order to reduce uncontrollable variables, making it easier to get relatively consistent results with fewer tests. I agree with your suggestions regarding the CPU/RSS/queues configuration. As for 125%(auto), based on the iperf -N and ntttcp tests, I personally think that 50% for 256 queue size and 25% for 1024 queue size looks good. Although, according to the test results, 75% for 256 queue size and 50% for 1024 queue size is also acceptable, but this is a default value which will used by all users, I think it is more appropriate to set a smaller value, because the tests above were conducted in a 10Gb environment, If the bandwidth is higher , such as 20Gb, 40Gb, higher default value might introduce more memory copying inside the protocol stack, leading to more CPU usage or impact on bandwidth. If users encounter some extreme packet loss situations, they can configure the param themselves. I plan to add a document to explain the MinRxBufferPercent param, so that users could understand how to adjust this parameter according to their actual circumstances. |
@80886 Thinking again about the default settings: I see that if we take as KPI the CPU/Throughput ratio out-of-the-box (and we probably do not have too much choice here) we need to stay with 0% as a default. When the responsiveness is important - 25% can be recommended. |
I agree, and this is exactly what I originally thought.
Yes.
Okay, I will revert the changes to the WMI structure and then submit a PR within next week. By the way, I saw your comment about "unlikely" on new commit , I think that if we set 0% as default(even 25%), there is no need to swap the code blocks within the if/else, right? |
|
I'm a bit confused, and to be honest, I haven't delved deeply into branch prediction or x86 architecture or MSVC, my understanding is limited to some basic concepts, and often, I rely on GCC's likely and unlikely when coding. I would appreciate it if I could ask a few questions. In my commit, the C code and the resulting compiled code are as follows:(latest EWDK 22H2) if (!isRxBufferShortage)
{
NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
indicate, 0, nIndicate,
NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL);
}
else
{
/* If the number of available RX buffers is insufficient, we set the
NDIS_RECEIVE_FLAGS_RESOURCES flag so that the RX buffers can be immediately
reclaimed once the call to NdisMIndicateReceiveNetBufferLists returns. */
NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
indicate, 0, nIndicate,
NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES);
NdisInterlockedAddLargeStatistic(&pContext->extraStatistics.rxIndicatesWithResourcesFlag, nIndicate);
ParaNdis_ReuseRxNBLs(indicate);
pContext->m_RxStateMachine.UnregisterOutstandingItems(nIndicate);
} .text:0000000140005742 mov rcx, [r15+8] ; MiniportAdapterHandle
.text:0000000140005746 xor r8d, r8d ; PortNumber
.text:0000000140005749 mov r9d, ebx ; NumberOfNetBufferLists
.text:000000014000574C mov rdx, rsi ; NetBufferList
.text:000000014000574F cmp byte ptr [rbp+var_isRxBufferShortage], r8b
.text:0000000140005753 jnz short loc_140005767
.text:0000000140005753
.text:0000000140005755 mov [rsp+48h+ReceiveFlags], 1 ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL <<------------
.text:000000014000575D call NdisMIndicateReceiveNetBufferLists_0
.text:000000014000575D
.text:0000000140005762 jmp loc_14000582D
.text:0000000140005762
.text:0000000140005767 ; ---------------------------------------------------------------------------
.text:0000000140005767
.text:0000000140005767 loc_140005767: ; CODE XREF: RxDPCWorkBody+157↑j
.text:0000000140005767 mov [rsp+48h+ReceiveFlags], 3 ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES
.text:000000014000576F call NdisMIndicateReceiveNetBufferLists_0 After swap if/else: if (isRxBufferShortage)
{
/* If the number of available RX buffers is insufficient, we set the
NDIS_RECEIVE_FLAGS_RESOURCES flag so that the RX buffers can be immediately
reclaimed once the call to NdisMIndicateReceiveNetBufferLists returns. */
NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
indicate, 0, nIndicate,
NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES);
NdisInterlockedAddLargeStatistic(&pContext->extraStatistics.rxIndicatesWithResourcesFlag, nIndicate);
ParaNdis_ReuseRxNBLs(indicate);
pContext->m_RxStateMachine.UnregisterOutstandingItems(nIndicate);
}
else
{
NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
indicate, 0, nIndicate,
NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL);
} .text:0000000140005742 mov rcx, [r15+8] ; MiniportAdapterHandle
.text:0000000140005746 xor r8d, r8d ; PortNumber
.text:0000000140005749 mov r9d, ebx ; NumberOfNetBufferLists
.text:000000014000574C mov rdx, rsi ; NetBufferList
.text:000000014000574F cmp byte ptr [rbp+var_isRxBufferShortage], r8b
.text:0000000140005753 jz loc_140005821
.text:0000000140005753
.text:0000000140005759 mov [rsp+48h+ReceiveFlags], 3 ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES <<----------
.text:0000000140005761 call NdisMIndicateReceiveNetBufferLists_0
.........
.........
.........
.text:0000000140005821 loc_140005821: ; CODE XREF: RxDPCWorkBody+157↑j
.text:0000000140005821 mov [rsp+48h+ReceiveFlags], 1 ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL <<----------
.text:0000000140005829 call NdisMIndicateReceiveNetBufferLists_0 Comparing the two sets of code, based on my limited understanding, I think that the code before the swap is more friendly to x86 static branch prediction. Or , are there some things I don't know? such as within the logic of static branch prediction, will the semantics of taken/not taken differ between j* and jn* instructions? |
@80886 You're right, I need to recheck the behavior, what I said was based on my checks years ago. |
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default, this feature is turned off to minimize the impact on CPU performance and network bandwidth. To activate the feature, go to the network card properties and set the MinRxBufferPercent to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%, and if necessary, up to the maximum of 100%. In extreme cases, it may be required to not only adjust the MinRxBufferPercent but also to increase the number of queues and the size of each queue (queue_size). It is important to note that this feature is only applicable to rx packet loss. For more details, see [this thread](virtio-win#1012). Signed-off-by: junjiehua <[email protected]>
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default, this feature is turned off to minimize the impact on CPU performance and network bandwidth. To activate the feature, go to the network card properties and set the MinRxBufferPercent to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%, and if necessary, up to the maximum of 100%. In extreme cases, it may be required to not only adjust the MinRxBufferPercent but also to increase the number of queues and the size of each queue (queue_size). It is important to note that this feature is only applicable to rx packet loss. For more details, see [this thread](virtio-win#1012). Signed-off-by: junjiehua <[email protected]>
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default, this feature is turned off to minimize the impact on CPU performance and network bandwidth. To activate the feature, go to the network card properties and set the MinRxBufferPercent to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%, and if necessary, up to the maximum of 100%. In extreme cases, it may be required to not only adjust the MinRxBufferPercent but also to increase the number of queues and the size of each queue (queue_size). It is important to note that this feature is only applicable to rx packet loss. For more details, see [this thread](#1012). Signed-off-by: junjiehua <[email protected]>
Under various conditions when packet rate host-to-device is high we loss packets in QEMU due to lack of guest-allocated buffers. Look also in virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Under various conditions, when the host-to-device packet rate is high, we lose packets in QEMU due to a lack of guest-allocated buffers. Look also at virtio-win/kvm-guest-drivers-windows#1012 Signed-off-by: wji <[email protected]>
Describe
When a TCP-based application enables the TCP_NODELAY option by
setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
(disables TCP Nagle), if the TCP Server does not receive packets as efficiently as possible, the operating system's protocol stack's special behavior when receiving packets can temporarily lead to the exhaustion of the RX Descriptors, and resulting in packet loss.To Reproduce
Create two same VMs:
On VM1:
python.exe ./server.py 8000
.On VM2:
python.exe ./client.py 256 $VM1_IP 8000
.ping $VM1_IP -t
.(server.py and client.py can be found in the Additional context section.)
Expected behavior
When pinging VM1 from VM2, severe packet loss should not occur.
VM:
Additional context
server.py:
client.py:
A brief description of the code above:
server.py
:starts a simple TCP Server, accepts TCP connections and receives data. There is a special line:
The purpose of this line is to simulate a poorly designed, slow-running TCP Server. (In real-world scenarios, such "sleep" might be due to slow disk IO, slow external HTTP requests, slow SQL queries, or CPU contention.)
client.py
:creates a specified number of TCP connections and then continuously sends data to the Server. There is also a special line:
The purpose is to disable TCP Nagle's algorithm, without this line, the packet loss issue may not be easily reproduced.
The packet loss issue mostly occurs in TCP-based, latency-sensitive gaming server (this type of application usually sets
TCP_NODELAY
). I have conducted some analysis on this packet loss issue, and the following are the root cause and solutions.Root cause:
After the netkvm driver passes packets to the protocol stack using NdisMIndicateReceiveNetBufferLists, the protocol stack, under normal conditions, is expected to quickly return the buffers to the netkvm driver via NDISReturnNetBufferLists. The netkvm driver then re-queues these buffers into the virtio queue.
However, the Windows protocol stack incorporates a particular mechanism: if the application doesn't retrieve packets quickly, the protocol stack won't return the buffers immediately but will hold them temporarily (for up to 1 second). During such intervals, if lots of packets arrives, the virtio queue may run short on free buffers, potentially resulting in packet loss.
I had implemented two solutions:
Solution 1:
In processRxRing(), Check if the available buffers in virtio queue is below a certain percentage (e.g., 25%)
If it is, then pass
NDIS_RECEIVE_FLAGS_RESOURCES
when callingNdisMIndicateReceiveNetBufferLists
. This flag informs the Windows protocol stack not to hold the buffers but to copy the packets and return the buffer to netkvm driver immediately;Provide a configurable parameter:
MinRxBufferPercent
.Commit link
Solution 2:
In processRxRing(), Check if the available buffers in virtio queue is below a certain percentage (e.g., 25%)
If it is, then allocate more rx buffers using
NdisMAllocateSharedMemoryAsyncEx
.Provide two configurable parameters:
a. MinRxBufferPercent: Almost same as MinRxBufferPercent parameter in solution 1.
This parameter indicates that when the number of available buffers in the RX queue falls below init.rxCapacity * MinRxBufferPercent%, additional memory will be allocated via NdisMAllocateSharedMemoryAsyncEx.
b. MaxInflightAllocMultiplier: This parameter limits the maximum number of in-progress asynchronous memory allocations, without this limit, could result in high memory or CPU consumption (as will be discussed later).
Commit link
Both solutions can effectively fix the packet loss issue (or at least mitigate it to a large extent), but each solution has its own set of pros and cons, I haven't decided which solution to submit a pull request for and would like to hear everyone's opinions before making a decision.
Solution 1:
The downside of using
NDIS_RECEIVE_FLAGS_DISPATCH_LEVELNDIS_RECEIVE_FLAGS_RESOURCES to send packets to the protocol stack is that it causes some extra work inside the protocol stack, like allocating and copying memory. From what I've seen in my tests, this extra work can lead to using up around 5% CPU usage (It doesn't continuously using 5% CPU, only when the application is running slowly). We've tried this solution on our cloud service for a few customers who were having packet loss trouble, and it's gone over well.Solution 2:
This solution avoid the additional allocating and copying inside the protocol stack. But there is also a downside: Sometimes NdisMAllocateSharedMemoryAsyncEx is pretty slow at allocating memory. In the worst-case, it may lead to the NDIS kernel thread consuming an entire core on an Intel 8255C CPU.
Furthermore, this solution involves memory management, and the user needs to carefully set parameters, otherwise, it could lead to memory exhaustion or high CPU usage by the NDIS kernel thread. In this respect, Solution 1 is simpler, because netkvm does not directly allocate additional memory. Instead, the NDIS protocol stack is responsible for its own allocation and copying.
There is another solution, which has not been implemented:
During netkvm initialization, allocate additional RxDescriptors based on the Init.RxCapacity parameter (or add new parameter which can be configured by the user), and then use these extra Buffers to fill the virtio queue during processRxRing(). This should be work too, but it has not been tested.
Additionally, when RSC is defined, each buffer is approximately 64KB in size (MAX_HW_RX_PACKET_SIZE), simply pre-allocating more RxDescriptors during driver init could potentially consume an excessive amount of memory(such as with 16 queue 1024 queue_size, it's 16*1024*64K=1024M). Implementing this solution would require more consideration.
I personally prefer to use solution 1, it's simple to implementation, simple to use, and according to my testing, many physical network card drivers, including the XEN PV Driver, also use NDIS_RECEIVE_FLAGS_RESOURCES to mitigate packet loss issues.
Both solutions require one or more configurable parameters, and it is difficult to determine a default parameter that would be suitable for all cases, as packet loss is related to the number of connections/packet quantity/number of queues/queue_size. Therefore, my idea is to provide this feature, but not configure a default value (keeping the function disabled by default), leaving it up to the user to decide.
Current implementations of Solutions 1 and 2 are just for testing and discussion, not the final version(especially Solution 2). I would like to discuss with everyone which solution is more appropriate. After we reach a final decision, I will continue to refine them before submitting a pull request.
Thanks!
The text was updated successfully, but these errors were encountered: