Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packet loss and solutions #1012

Closed
80886 opened this issue Dec 15, 2023 · 19 comments
Closed

packet loss and solutions #1012

80886 opened this issue Dec 15, 2023 · 19 comments

Comments

@80886
Copy link

80886 commented Dec 15, 2023

Describe

When a TCP-based application enables the TCP_NODELAY option by setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) (disables TCP Nagle), if the TCP Server does not receive packets as efficiently as possible, the operating system's protocol stack's special behavior when receiving packets can temporarily lead to the exhaustion of the RX Descriptors, and resulting in packet loss.

To Reproduce

Create two same VMs:

  • 4 cores
  • 8GB RAM
  • A virtio network card (queues:2, queue_size:1024)
  • Install Windows Server 2019 Datacenter and apply the latest update

On VM1:

  1. Run python.exe ./server.py 8000.

On VM2:

  1. First, run python.exe ./client.py 256 $VM1_IP 8000.
  2. Then, in another command prompt window, run ping $VM1_IP -t.

(server.py and client.py can be found in the Additional context section.)

Expected behavior

When pinging VM1 from VM2, severe packet loss should not occur.

VM:

  • Windows Server 2019 Datacenter(with update Nov,2023)
  • Can be reproduced with both the latest master build and the latest stable version(virtio-win-0.1.240.iso) of the Netkvm driver.

Additional context

server.py:

import asyncio
import time
import random
import os
import sys

async def handle_echo(reader, writer):
    try:
        while True:
            data = await reader.read(1024)
            if not data:
                break
            await asyncio.sleep(0.2)
    except Exception as e:
        print(f'Error in handle_echo: {str(e)}')
    finally:
        if not writer.is_closing():
                writer.close()
                await writer.wait_closed()

async def main():
    server = await asyncio.start_server(
        handle_echo, '0.0.0.0', int(sys.argv[1]))

    addr = server.sockets[0].getsockname()
    print(f'Serving on {addr}')

    async with server:
        try:
            await server.serve_forever()
        except Exception as e:
            print(f'Error in server: {e}')

asyncio.run(main())

client.py:

import sys
import socket
import asyncio
import os
import psutil

def is_vms_greater_than_1G():
    pid = os.getpid()

    proc = psutil.Process(pid)

    mem_info = proc.memory_info()

    return mem_info.vms > (1024 * 1024 * 1024)

msg = b"test"

async def tcp_echo_client(id, host, port):
    loop = asyncio.get_running_loop()

    while True:
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
            sock.connect((host, port))

            while True:
                if is_vms_greater_than_1G(): #avoid OOM
                    await asyncio.sleep(0.01)
                await loop.sock_sendall(sock, msg)    

        except Exception as e:
            print(f'Exception:{ str(e) }. Reconnecting...')
            sock.close()
            await asyncio.sleep(0.01)

async def main(n, host, port):
    tasks = [tcp_echo_client(i, host, port) for i in range(n)]
    await asyncio.gather(*tasks)

n = int(sys.argv[1]) if len(sys.argv) > 1 else 1
host = sys.argv[2] if len(sys.argv) > 2 else 'localhost'
port = int(sys.argv[3]) if len(sys.argv) > 3 else 8000

asyncio.run(main(n, host, port))

A brief description of the code above:

server.py:

starts a simple TCP Server, accepts TCP connections and receives data. There is a special line:

await asyncio.sleep(0.2)

The purpose of this line is to simulate a poorly designed, slow-running TCP Server. (In real-world scenarios, such "sleep" might be due to slow disk IO, slow external HTTP requests, slow SQL queries, or CPU contention.)

client.py:

creates a specified number of TCP connections and then continuously sends data to the Server. There is also a special line:

sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

The purpose is to disable TCP Nagle's algorithm, without this line, the packet loss issue may not be easily reproduced.


The packet loss issue mostly occurs in TCP-based, latency-sensitive gaming server (this type of application usually sets TCP_NODELAY). I have conducted some analysis on this packet loss issue, and the following are the root cause and solutions.

Root cause:

After the netkvm driver passes packets to the protocol stack using NdisMIndicateReceiveNetBufferLists, the protocol stack, under normal conditions, is expected to quickly return the buffers to the netkvm driver via NDISReturnNetBufferLists. The netkvm driver then re-queues these buffers into the virtio queue.

However, the Windows protocol stack incorporates a particular mechanism: if the application doesn't retrieve packets quickly, the protocol stack won't return the buffers immediately but will hold them temporarily (for up to 1 second). During such intervals, if lots of packets arrives, the virtio queue may run short on free buffers, potentially resulting in packet loss.


I had implemented two solutions:

Solution 1:

  1. In processRxRing(), Check if the available buffers in virtio queue is below a certain percentage (e.g., 25%)

  2. If it is, then pass NDIS_RECEIVE_FLAGS_RESOURCES when calling NdisMIndicateReceiveNetBufferLists. This flag informs the Windows protocol stack not to hold the buffers but to copy the packets and return the buffer to netkvm driver immediately;

  3. Provide a configurable parameter: MinRxBufferPercent.

    • If set to 0, it means this feature is disabled;
    • If set to 100, it means NDIS_RECEIVE_FLAGS_RESOURCES will be indicated for every packet. (This is not commonly used and is intended for extreme cases of severe packet loss.)
    • If set to other value, for example 50, it means that if the available buffer count in the rx queue is less than 50% of init.rxCapacity, then NDIS_RECEIVE_FLAGS_RESOURCES will be indicated.
  4. Commit link

Solution 2:

  1. In processRxRing(), Check if the available buffers in virtio queue is below a certain percentage (e.g., 25%)

  2. If it is, then allocate more rx buffers using NdisMAllocateSharedMemoryAsyncEx.

  3. Provide two configurable parameters:

    a. MinRxBufferPercent: Almost same as MinRxBufferPercent parameter in solution 1.
    This parameter indicates that when the number of available buffers in the RX queue falls below init.rxCapacity * MinRxBufferPercent%, additional memory will be allocated via NdisMAllocateSharedMemoryAsyncEx.

    b. MaxInflightAllocMultiplier: This parameter limits the maximum number of in-progress asynchronous memory allocations, without this limit, could result in high memory or CPU consumption (as will be discussed later).

  4. Commit link


Both solutions can effectively fix the packet loss issue (or at least mitigate it to a large extent), but each solution has its own set of pros and cons, I haven't decided which solution to submit a pull request for and would like to hear everyone's opinions before making a decision.

Solution 1:

The downside of using NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL NDIS_RECEIVE_FLAGS_RESOURCES to send packets to the protocol stack is that it causes some extra work inside the protocol stack, like allocating and copying memory. From what I've seen in my tests, this extra work can lead to using up around 5% CPU usage (It doesn't continuously using 5% CPU, only when the application is running slowly). We've tried this solution on our cloud service for a few customers who were having packet loss trouble, and it's gone over well.

Solution 2:

This solution avoid the additional allocating and copying inside the protocol stack. But there is also a downside: Sometimes NdisMAllocateSharedMemoryAsyncEx is pretty slow at allocating memory. In the worst-case, it may lead to the NDIS kernel thread consuming an entire core on an Intel 8255C CPU.

Furthermore, this solution involves memory management, and the user needs to carefully set parameters, otherwise, it could lead to memory exhaustion or high CPU usage by the NDIS kernel thread. In this respect, Solution 1 is simpler, because netkvm does not directly allocate additional memory. Instead, the NDIS protocol stack is responsible for its own allocation and copying.


There is another solution, which has not been implemented:

During netkvm initialization, allocate additional RxDescriptors based on the Init.RxCapacity parameter (or add new parameter which can be configured by the user), and then use these extra Buffers to fill the virtio queue during processRxRing(). This should be work too, but it has not been tested.

Additionally, when RSC is defined, each buffer is approximately 64KB in size (MAX_HW_RX_PACKET_SIZE), simply pre-allocating more RxDescriptors during driver init could potentially consume an excessive amount of memory(such as with 16 queue 1024 queue_size, it's 16*1024*64K=1024M). Implementing this solution would require more consideration.


I personally prefer to use solution 1, it's simple to implementation, simple to use, and according to my testing, many physical network card drivers, including the XEN PV Driver, also use NDIS_RECEIVE_FLAGS_RESOURCES to mitigate packet loss issues.

Both solutions require one or more configurable parameters, and it is difficult to determine a default parameter that would be suitable for all cases, as packet loss is related to the number of connections/packet quantity/number of queues/queue_size. Therefore, my idea is to provide this feature, but not configure a default value (keeping the function disabled by default), leaving it up to the user to decide.

Current implementations of Solutions 1 and 2 are just for testing and discussion, not the final version(especially Solution 2). I would like to discuss with everyone which solution is more appropriate. After we reach a final decision, I will continue to refine them before submitting a pull request.

Thanks!

@80886
Copy link
Author

80886 commented Dec 15, 2023

More information:

The Hold Nic Buffers behavior in the Windows protocol stack is not documented on Microsoft's website. However, based on the results of my analysis using Windbg:

  1. There is an undocumented registry parameter that can control the behavior of the protocol stack: HKLM\\SYSTEM\\CurrentControlSet\\Services\\AFD\\Parameters\\DoNotHoldNicBuffers: REG_DWORD
  2. This behavior is enabled by default on Windows Server OS and disabled by default on Windows Client OS.

By executing reg add "HKLM\SYSTEM\CurrentControlSet\Services\AFD\Parameters" /v "DoNotHoldNicBuffers" /d 1 /t REG_DWORD /f and then restarting the VM, the packet loss issue can also be mitigated effectively. However, since this parameter is not officially documented, it cannot be guaranteed to be supported across all Windows versions.

@ybendito
Copy link
Collaborator

ybendito commented Dec 15, 2023

  1. This behavior is enabled by default on Windows Server OS and disabled by default on Windows Client OS.
    @80886 NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL => NDIS_RECEIVE_FLAGS_RESOURCES
    Very interesting
    @YanVugenfirer

@80886
Copy link
Author

80886 commented Dec 15, 2023

  1. This behavior is enabled by default on Windows Server OS and disabled by default on Windows Client OS.
    @80886 NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL => NDIS_RECEIVE_FLAGS_RESOURCES
    Very interesting
    @YanVugenfirer

I think this is a special optimization that Microsoft has implemented for the packet receiving process: copying directly from the network card driver's buffer to the user-mode buffer provided by wsarecv, bypassing any intermediate allocations and copies. However, when the number of pre-allocated RX buffers by the network card driver is fixed, this special optimization can lead to packet loss in some cases.

In afd.sys!AfdReadRegistry(called by afd!DriverEntry), the logic is:

(according to windbg/ida with public symbols)

  ....
   memset(&VersionInformation, 0, sizeof(VersionInformation));
    VersionInformation.dwOSVersionInfoSize = 284;
    v3 = RtlGetVersion((PRTL_OSVERSIONINFOW)&VersionInformation) < 0 || VersionInformation.wProductType == 1;
    AfdDoNotHoldNICBuffers = v3;
   ....
 typedef  enum 
 {
   VER_NT_WORKSTATION = 0x00000001,
   VER_NT_DOMAIN_CONTROLLER = 0x00000002,
   VER_NT_SERVER = 0x00000003
 } OS_TYPE;

if VersionInformation.wProductType == 1(VER_NT_WORKSTATION ), then AfdDoNotHoldNICBuffers will be 1, means "hold nic buffers" behavior get disabled for client OS by default.

For VER_NT_SERVER or VER_NT_DOMAIN_CONTROLLER, it's value will be 0, unless value 1 is explicitly specified in the registry.

@80886
Copy link
Author

80886 commented Dec 15, 2023

When using NdisMAllocateSharedMemoryAsyncEx to allocate memory, the NDIS kernel thread for asynchronous memory allocation consumes a significant amount of CPU(worst-case):

ndishighcpu

It seems to be searching for contiguous physical pages (even though physical memory usage was actually quite low, with 2GB still free).

The issue of memory allocation consuming CPU resources appears to be related to Windows' large page memory management at first glance. Further investigation is required, but progress will be slow without the right to access Windows source code or private symbols.

@80886
Copy link
Author

80886 commented Dec 15, 2023

When using NdisMAllocateSharedMemoryAsyncEx to allocate memory, the NDIS kernel thread for asynchronous memory allocation consumes a significant amount of CPU(worst-case):

ndishighcpu

It seems to be searching for contiguous physical pages (even though physical memory usage was actually quite low, with 2GB still free).

The issue of memory allocation consuming CPU resources appears to be related to Windows' large page memory management at first glance. Further investigation is required, but progress will be slow without the right to access Windows source code or private symbols.

Regarding the issue of NdisMAllocateSharedMemoryAsyncEx, I have conducted the following test so far:

Allocating memory page by page according to the 4K PAGE_SIZE, intended to avoid the Windows kernel's search for contiguous physical pages. However, when large pages are enabled, allocating 4K pages is still inefficient and can be even slower.

@ybendito
Copy link
Collaborator

@80886 First of all, thank you for the analysis, this is a good point to start from, the problem that you raise is real and actual.
Several questions to understand your environment:

  • Do you use QEMU/QEMU fork/Some different virtualization?
  • How many hardware queues and guest CPUs you have in the production environment? (I understand that you're able to repro the problem also in minimal environment like 1 queue and 1 client, but the question is about production environment),
  • If you use QEMU or close fork: a) what is qemu base version? b) do you use virtio-net with vhost=on or off?
  • do you suppress RSC?
  • do you use RedHat driver or Fedora driver or your own brand?
    Now regarding solution 1 (I also think it is more practical):
    I tried it with little different setup: 4 queues, 4 cpu, vhost = off, rss = on
    Such setting should be ideal for your code (when num of vcpus = num of queues, rss in the device is perfecly works and there are no rss misses, rx virqueue is the same as rss reporting queue).
    rx queues size was the default one, i.e 256.
    on guest: iperf3 -s
    on host: perf3 -c guest-ip -l 10 -N -P4, this gives 10 seconds test transferring approx the same amount of data
    I've built qemu to be able to see the number of lost packets after each session
    in such setup I expected that the behaviour with 25% will be better than with 0%, but... this is not what I see.
    my results show that both 25% and 100% have higher packet loss than 0%. Surprisingly.
    (numbers: 100% and 25% - average 700 packets lost per session of iperf, 0% - 120)
    Even if I use rx_queue_size=1024 in qemu (for consistency need to change it also in the driver), I do not see that 25% brings some improvement (0% and 25% - the same average of ~30 losses per session).
    In general I do not agree with some parts of the code, but the principle of conditional NDIS_RECEIVE_FLAGS_RESOURCES is correct, IMO.
    So, probably we need to continue looking on it.

@80886
Copy link
Author

80886 commented Dec 16, 2023

@ybendito

I'll provide some information that I can recall for now, and some details will need to be confirmed next week.

  1. The qemu we used is an old stable version(around 2018, CentOS) , but the exact version will be confirmed on Monday.
  2. In the two cases I investigated, one has 16 cores and 32G of memory, 8 queues, with a queue size of 1024, while the other has 8 cores and 16G of memory, 4 queues, with a queue size of 1024.
  3. a) The exact qemu version will be confirmed next week; b) vhost=on , in the two cases, one is vhost-kernel, and the other is vhost-user. In our production environment, we have never used vhost=off.
  4. RSC disabled.
  5. We cloned and compiled virtio-win in 2020; the exact version will be confirmed next week.

@ybendito
Copy link
Collaborator

ybendito commented Dec 18, 2023

@80886 I do not suggest to use the vhost=off in the production environment. I think that if we're looking for the solution in the driver, the solution should be the same regardless how exactly the problem is triggered. I'm looking for suitable repro scenario also with vhost, yet I'm able to reproduce it with iperf3, I see the losses (~200) in the first session after boot and I'm not sure that all the losses are reflected in TX.dropped.
IMO, the commit has 2 parts - one is in general correct, one is wrong (the trick with DpcContext). You assume here that the queue that really triggered the rescheduling of the packet is the one that should be checked for RX level. This is not correct. Other queues may also push to this RSS queue but they did not trigger the DPC as the DPC is already triggered.
What I recommend to do: as you anyway reconfigure the adapter (RSC off, 1024 queues) - reduce the number of max RSS queues to the number of physical queues (and number of CPUs should be >= num queues), then with TCP you should get the correct delivery to proper virtqueues without rescheduling. Then the part of the commit containing DpcContext can be removed. You can check that your number of reschedules is negligible by "sudo netkvm-wmi.cmd rss" (in the repo) - the number of rss misses should be very-very low and rss hits - high.
As you cloned and built your own the driver - are you doing your own HCK?

@80886
Copy link
Author

80886 commented Dec 19, 2023

@ybendito

Thanks very much for your guidance.

Regarding the iperf3 700(25%)/120(0%) packet loss:

I can also reproduce it with solution 1 commit. The packet loss in our production environment first occurred about two years ago, and at that time, our temporary emergency fix was indeed also to use NDIS_RECEIVE_FLAGS_RESOURCES, however, compared to the solution 1 commit, ther is a key difference is that instead of using DPCContext, we call pBufferDescriptor->Queue->IsRxBufferShortage() directly within ProcessReceiveQueue().

I applied the same temporary fix as we did back then, commit here. With this commit, running the iperf3 -N test again in the same environment, the packet loss decreased from 1,713,780 to 6,109 compared to the previous solution 1 commit.

test environment:

vhostuser, 2 queues + 256 queue_size , 4 CPUs, 4 rss queues;

test script:

run.\iperf3.exe -c $guest_ip -P 4 -N -l 10 100 times.

@echo off
setlocal
set /a "count=0"

:loop
if %count%==100 goto end
echo %count%
.\iperf3.exe -c `$guest_ip` -P 4 -N -l 10
set /a "count+=1"
goto loop

:end
endlocal

In my test environment, if I run iperf only once, regardless of which commit is used or what value the parameters are set to, packet loss may or may not occur, which is not convenient for comparison. If I set a single session to run for a long time using the -t parameter, such as for 3 minutes, packet loss mostly occurs only in the first few seconds, with no losses afterward. This is the reason for the script to loop 100 times, with each iteration lasting 10 seconds.

test result:

solution 1 commit with MinRxBufferPercent set to 25%:

[root@kvm ~]# virsh domifstat 88352212-4ef9-4235-b351-334a5b9e81a2 veth_271982a2
veth_271982a2 rx_bytes 1906148869
veth_271982a2 rx_packets 19802095
veth_271982a2 rx_errs 0
veth_271982a2 rx_drop 1713780         <<<============
veth_271982a2 tx_bytes 388468430
veth_271982a2 tx_packets 7117675
veth_271982a2 tx_errs 0
veth_271982a2 tx_drop 0

new commit with MinRxBufferPercent set to 25%:

[root@kvm ~]# virsh domifstat 08730dcd-42e5-4040-b58d-cca2b87b1484 veth_54099Ddd
veth_54099Ddd rx_bytes 2246687914
veth_54099Ddd rx_packets 26851892
veth_54099Ddd rx_errs 0
veth_54099Ddd rx_drop 6109          <<<============
veth_54099Ddd tx_bytes 687744795
veth_54099Ddd tx_packets 12693977
veth_54099Ddd tx_errs 0
veth_54099Ddd tx_drop 0

new commit, set to 0%:

[root@kvm ~]# virsh domifstat 08730dcd-42e5-4040-b58d-cca2b87b1484 veth_54099Ddd
veth_54099Ddd rx_bytes 1340918784
veth_54099Ddd rx_packets 9009548
veth_54099Ddd rx_errs 0
veth_54099Ddd rx_drop 1385165   <<<========
veth_54099Ddd tx_bytes 227663224
veth_54099Ddd tx_packets 4130477
veth_54099Ddd tx_errs 0
veth_54099Ddd tx_drop 0

new commit set to 0%, set afd registry param DoNotHoldNicBuffers to 1:

[root@kvm ~]# virsh domifstat 88352212-4ef9-4235-b351-334a5b9e81a2 veth_271982a2
veth_271982a2 rx_bytes 2043782245
veth_271982a2 rx_packets 22922345
veth_271982a2 rx_errs 0
veth_271982a2 rx_drop 449324    <<<========
veth_271982a2 tx_bytes 441322380
veth_271982a2 tx_packets 8094288
veth_271982a2 tx_errs 0
veth_271982a2 tx_drop 0
 
(I conducted this test four times; on one occasion there was zero packet loss, while the other three instances each had around 500,000 losses, it appears that set the registry param to 1 has some effect, but not as good as expected.)

Data discrepancies can occur from test to test, likely influenced by the behavior of the protocol stack or TCP congestion control. However, the general trend is stable: in the tests mentioned above, the new commit with the parameter set at 25% invariably shows improved performance.

The new commit probably still needs modification, because it calls isRxBufferShortage() for every pBufferDescriptor in ProcessReceiveQueue(), although the overhead should not be too significant in theory.

Regarding suggestions for RSS hits/misses:

I will test it in the next couple of days.

Regarding HCK:
No.
We do not often modify or do many development on forked virtio-win. We only make changes when need to add some logs for debugging our backend virtio devices, or to temporarily fix some special issues in production environment(like this packet loss).

@ybendito
Copy link
Collaborator

@80886 This one (I mean this) looks ok to me and with it you can stay even without changing MaxRSSQueues. I've added 2 small comments to it. IMO, you can convert it to PR.

@ybendito
Copy link
Collaborator

@80886 Do you have some news that you want to share?

@80886
Copy link
Author

80886 commented Jan 5, 2024

@ybendito

I added some code to stat rxMinFreeBuffers, rxIndicatesWithResourcesFlag, and totalRxIndicates(commit link), and then conducted some tests(typical tcp case).

Test env:

16 cpus(Intel 8255C), 8 queues, 16 rss, max bandwidth 10Gbit/s

Command line:

(typical TCP case)
Server: .\ntttcp.exe -r -m 32,*,172.19.32.19 -sb 65536 -rb 65536 -t 120
Client: .\ntttcp.exe -s -m 32,*,172.19.32.19 -sb 65536 -rb 65536 -t 120

run 3 times and take the average value:

PktLoss rxMinFreeBuffers rxIndicatesWithResourcesFlag totalRxIndicates Avg CPU% BW (MBytes/sec)
256 - 0% 127 0 0 49931807 18.920 1118.500
256 - 25% 71 0 4199 (~0%) 49933289 19.175 1117.419
256 - 50% 62 0 891074 (1.7%) 49669032 21.700 1114.874
256 - 75% 0 0 3836455 (7.7%) 49679704 24.640 1113.305
256 - 100% 0 0 49228454 (100%) 49228454 28.155 1103.428
 
1024 - 0% 0 395 0 49850653 19.035 1118.551
1024 - 25% 0 478 1679 (~0%) 49899849 20.090 1118.546
1024 - 50% 0 491 2963 (~0%) 49904503 20.173 1117.954
1024 - 75% 0 578 31955 (0.06%) 49825004 22.594 1116.413
1024 - 100% 0 712 49231969 (100%) 49231969 28.059 1102.062

From the table above:

  1. with 256 queue size, even if we set MinRxBufferPercent to 75% or 100%, rxMinFreeBuffers still drops to 0 (may be due to the burst traffic).I believe packet loss could still occur if we do the test more times.
  2. with 1024 queue size, in typical tcp case, even without setting MinRxBufferPercent, rxMinFreeBuffers remains sufficient. (But in TCP_NODELAY case, increase the queue_size is not enough, when we first encountered packet loss due to tcp_nodelay/AfdHoldNicBuffers, our initial attempt was to increase the queue_size/queue_number, which had some effect, but there was still packet loss.)
  3. when MinRxBufferPercent is set to 0% and 100% respectively, CPU usage increases by about 10%, and bandwidth decreases slightly (from 1118MB/s to 1100MB/s).

Therefore, I think that for a 10Gb typical TCP case, a reasonable configuration of queue_size is better, and should not rely on MinRxBufferPercent too much. MinRxBufferPercent is mainly used for tcp_nodelay/AfdHoldNicBuffers scenarios, and a default value of 50% appears to be acceptable.

@ybendito
Copy link
Collaborator

ybendito commented Jan 6, 2024

@80886 Thank you for sharing the results.
Q1) Is 'PktLoss' comes from TCP retransmission?
Q2) are sender and receiver on the same physical machines or on different ones (I guess that on different ones and the link is 10Fb)?
Q4) RSC is disabled?
From the command line it looks like you're far enough from your desired use case, aren't you?
Correct me if I'm wrong.
I'd expect something with TCP_NODELAY (ntttcp -ndl) at least on sender size and something like -rb 0 on receiver size.
Default packet length is 64K is what you planned?
Another thing I would suggest (mainly for production setup, but probably also for test setup if it should be similar to production one) - if you select 16 rss queues and have 16 cpus - use also 16 virtio queues, then the redundant operations (like rail changing from 8 queues in HW to 16 queues in OS) should be minimized. The same if you use less cpus. (16 for VM is typical?)
May be (just a thought) we can use 0, 25, 50, 75, 100, 125 as possible values for MinRxBufferPercent when 125 means 'Auto' which will set it according to queue size - higher value for short queue?

@80886
Copy link
Author

80886 commented Jan 6, 2024

@ybendito

  1. PktLoss is from virsh domifstat ;
  2. On diffrent physical host with 25Gb link.(The bandwidth between VMs is limited to 10Gb by DPDK QOS policy)
  3. RSC disabled.

The reason I didn't use ntttcp -ndl is that in this test, my purpose is to observe the impact on CPU usage and bandwidth in a typical TCP case (TCP aggregation) , when set to 25%, 50%, 75%, and 100% respectively. In scenarios with TCP_NODELAY, although there are a large number of packets, bue the packet size is smaller and bandwidth usage is lower, so I am more concerned with the mitigation effects on packet loss. (previous iperf -N test).

As for the -rb 64k -sb 64k, because Windows uses dynamic socket buffer size by default, I had encountered performance(CPU/BW) uncertainties introduced by dynamic buffer adjustments on some earlier versions of Windows years ago, which often required more tests to obtain stable results (In newer versions of Windows, dynamic buffer tuning may not have this issue, but I hadn't test it), so when doing TCP performance testing, I habitually set a fixed socket buffer size in order to reduce uncontrollable variables, making it easier to get relatively consistent results with fewer tests.

I agree with your suggestions regarding the CPU/RSS/queues configuration.

As for 125%(auto), based on the iperf -N and ntttcp tests, I personally think that 50% for 256 queue size and 25% for 1024 queue size looks good. Although, according to the test results, 75% for 256 queue size and 50% for 1024 queue size is also acceptable, but this is a default value which will used by all users, I think it is more appropriate to set a smaller value, because the tests above were conducted in a 10Gb environment, If the bandwidth is higher , such as 20Gb, 40Gb, higher default value might introduce more memory copying inside the protocol stack, leading to more CPU usage or impact on bandwidth. If users encounter some extreme packet loss situations, they can configure the param themselves. I plan to add a document to explain the MinRxBufferPercent param, so that users could understand how to adjust this parameter according to their actual circumstances.

@ybendito
Copy link
Collaborator

ybendito commented Jan 6, 2024

@80886 Thinking again about the default settings: I see that if we take as KPI the CPU/Throughput ratio out-of-the-box (and we probably do not have too much choice here) we need to stay with 0% as a default. When the responsiveness is important - 25% can be recommended.
BTW (probably you know that) the with netkvmco.exe the configuration of any existing parameter can be done using command-line.
Please confirm that in the iperf test -N the "new commit" means this one. It looks that it better then "solution 1" mentioned above (the last one seems tending to use RESOURCES flag also when not so needed). Can we move to PR?
We're going to make some order it the MOF in nearest separate PR, split TX and RX things to separate structs etc.
So I suggest to not change the WMI structure right now (just update the statistics),

@80886
Copy link
Author

80886 commented Jan 6, 2024

@ybendito

"we need to stay with 0% as a default"

I agree, and this is exactly what I originally thought.

Please confirm that in the iperf test -N the "new commit" means this one.

Yes.

Can we move to PR?
We're going to make some order it the MOF in nearest separate PR, split TX and RX things to separate structs etc.
So I suggest to not change the WMI structure right now (just update the statistics)

Okay, I will revert the changes to the WMI structure and then submit a PR within next week.

By the way, I saw your comment about "unlikely" on new commit , I think that if we set 0% as default(even 25%), there is no need to swap the code blocks within the if/else, right?

@ybendito
Copy link
Collaborator

ybendito commented Jan 7, 2024

@80886

By the way, I saw your comment about "unlikely" on new commit , I think that if we set 0% as default(even 25%), there is no need to swap the code blocks within the if/else, right?
As Microsoft does not support likely/unlikely as designators of the default execution path - their rule is: what is under it - this does jump, under else - does not ))

@80886
Copy link
Author

80886 commented Jan 7, 2024

@ybendito

As Microsoft does not support likely/unlikely as designators of the default execution path - their rule is: what is under it - this does jump, under else - does not ))

I'm a bit confused, and to be honest, I haven't delved deeply into branch prediction or x86 architecture or MSVC, my understanding is limited to some basic concepts, and often, I rely on GCC's likely and unlikely when coding. I would appreciate it if I could ask a few questions.

In my commit, the C code and the resulting compiled code are as follows:(latest EWDK 22H2)

if (!isRxBufferShortage)
{
    NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
        indicate, 0, nIndicate,
        NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL);
}
else
{
    /* If the number of available RX buffers is insufficient, we set the
    NDIS_RECEIVE_FLAGS_RESOURCES flag so that the RX buffers can be immediately
    reclaimed once the call to NdisMIndicateReceiveNetBufferLists returns. */

    NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
        indicate, 0, nIndicate,
        NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES);

    NdisInterlockedAddLargeStatistic(&pContext->extraStatistics.rxIndicatesWithResourcesFlag, nIndicate);
    ParaNdis_ReuseRxNBLs(indicate);
    pContext->m_RxStateMachine.UnregisterOutstandingItems(nIndicate);
}
.text:0000000140005742 mov     rcx, [r15+8]                    ; MiniportAdapterHandle
.text:0000000140005746 xor     r8d, r8d                        ; PortNumber
.text:0000000140005749 mov     r9d, ebx                        ; NumberOfNetBufferLists
.text:000000014000574C mov     rdx, rsi                        ; NetBufferList
.text:000000014000574F cmp     byte ptr [rbp+var_isRxBufferShortage], r8b
.text:0000000140005753 jnz     short loc_140005767
.text:0000000140005753
.text:0000000140005755 mov     [rsp+48h+ReceiveFlags], 1       ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL  <<------------
.text:000000014000575D call    NdisMIndicateReceiveNetBufferLists_0
.text:000000014000575D
.text:0000000140005762 jmp     loc_14000582D
.text:0000000140005762
.text:0000000140005767 ; ---------------------------------------------------------------------------
.text:0000000140005767
.text:0000000140005767 loc_140005767:                          ; CODE XREF: RxDPCWorkBody+157↑j
.text:0000000140005767 mov     [rsp+48h+ReceiveFlags], 3       ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES
.text:000000014000576F call    NdisMIndicateReceiveNetBufferLists_0

After swap if/else:

if (isRxBufferShortage)
{
    /* If the number of available RX buffers is insufficient, we set the
      NDIS_RECEIVE_FLAGS_RESOURCES flag so that the RX buffers can be immediately
      reclaimed once the call to NdisMIndicateReceiveNetBufferLists returns. */

    NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
      indicate, 0, nIndicate,
      NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES);

    NdisInterlockedAddLargeStatistic(&pContext->extraStatistics.rxIndicatesWithResourcesFlag, nIndicate);
    ParaNdis_ReuseRxNBLs(indicate);
    pContext->m_RxStateMachine.UnregisterOutstandingItems(nIndicate);
}
else
{
    NdisMIndicateReceiveNetBufferLists(pContext->MiniportHandle,
      indicate, 0, nIndicate,
      NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL);
}
.text:0000000140005742 mov     rcx, [r15+8]                    ; MiniportAdapterHandle
.text:0000000140005746 xor     r8d, r8d                        ; PortNumber
.text:0000000140005749 mov     r9d, ebx                        ; NumberOfNetBufferLists
.text:000000014000574C mov     rdx, rsi                        ; NetBufferList
.text:000000014000574F cmp     byte ptr [rbp+var_isRxBufferShortage], r8b
.text:0000000140005753 jz      loc_140005821
.text:0000000140005753
.text:0000000140005759 mov     [rsp+48h+ReceiveFlags], 3       ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL | NDIS_RECEIVE_FLAGS_RESOURCES <<----------
.text:0000000140005761 call    NdisMIndicateReceiveNetBufferLists_0
.........
.........
.........
.text:0000000140005821 loc_140005821:                          ; CODE XREF: RxDPCWorkBody+157↑j
.text:0000000140005821 mov     [rsp+48h+ReceiveFlags], 1       ; NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL   <<----------
.text:0000000140005829 call    NdisMIndicateReceiveNetBufferLists_0

Comparing the two sets of code, based on my limited understanding, I think that the code before the swap is more friendly to x86 static branch prediction. Or , are there some things I don't know? such as within the logic of static branch prediction, will the semantics of taken/not taken differ between j* and jn* instructions?

@ybendito
Copy link
Collaborator

ybendito commented Jan 7, 2024

@80886 You're right, I need to recheck the behavior, what I said was based on my checks years ago.

80886 pushed a commit to 80886/kvm-guest-drivers-windows that referenced this issue Jan 16, 2024
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default,
this feature is turned off to minimize the impact on CPU performance and network bandwidth.

To activate the feature, go to the network card properties and set the MinRxBufferPercent
to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor
for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%,
and if necessary, up to the maximum of 100%.

In extreme cases, it may be required to not only adjust the  MinRxBufferPercent but also to increase
the number of queues and the size of each queue (queue_size).

It is important to note that this feature is only applicable to rx packet loss.

For more details, see [this thread](virtio-win#1012).

Signed-off-by: junjiehua <[email protected]>
80886 pushed a commit to 80886/kvm-guest-drivers-windows that referenced this issue Jan 16, 2024
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default,
this feature is turned off to minimize the impact on CPU performance and network bandwidth.

To activate the feature, go to the network card properties and set the MinRxBufferPercent
to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor
for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%,
and if necessary, up to the maximum of 100%.

In extreme cases, it may be required to not only adjust the  MinRxBufferPercent but also to increase
the number of queues and the size of each queue (queue_size).

It is important to note that this feature is only applicable to rx packet loss.

For more details, see [this thread](virtio-win#1012).

Signed-off-by: junjiehua <[email protected]>
YanVugenfirer pushed a commit that referenced this issue Jan 18, 2024
Implemented a workaround for rx packet loss encountered with TCP_NODELAY. By default,
this feature is turned off to minimize the impact on CPU performance and network bandwidth.

To activate the feature, go to the network card properties and set the MinRxBufferPercent
to a value above 0. A good starting point is to set MinRxBufferPercent to 25% and then monitor
for any packet loss. If packet loss continues, incrementally adjust the value to 50%, then 75%,
and if necessary, up to the maximum of 100%.

In extreme cases, it may be required to not only adjust the  MinRxBufferPercent but also to increase
the number of queues and the size of each queue (queue_size).

It is important to note that this feature is only applicable to rx packet loss.

For more details, see [this thread](#1012).

Signed-off-by: junjiehua <[email protected]>
@80886 80886 closed this as completed Jan 19, 2024
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 7, 2025
Under various conditions when packet rate host-to-device is high we loss
packets in QEMU due to lack of guest-allocated buffers. Look also in
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 7, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 10, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 10, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 14, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 22, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
heywji added a commit to heywji/tp-qemu that referenced this issue Jan 22, 2025
Under various conditions, when the host-to-device packet rate is high, we lose
packets in QEMU due to a lack of guest-allocated buffers. Look also at
virtio-win/kvm-guest-drivers-windows#1012

Signed-off-by: wji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants