Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AF_PACKET stopped receiving traffic #927

Closed
glazychev-art opened this issue Sep 12, 2023 · 8 comments · Fixed by #999
Closed

AF_PACKET stopped receiving traffic #927

glazychev-art opened this issue Sep 12, 2023 · 8 comments · Fixed by #999
Assignees

Comments

@glazychev-art
Copy link
Contributor

Description

https://jira.fd.io/browse/VPP-2081
This issues is a continuation of this discussion - networkservicemesh/govpp#9 (comment)

In short, there are two main suspects:

  • calico-vpp patches
  • tap interfaces, not af_packet.
@ljkiraly
Copy link
Contributor

Hi,
Latest test on SUSE linux has failed (image without calico patches and TAP interface creation disabled in NSM). So we focus on Ubuntu based images.
@glazychev-art could you provide an image with removed calico patches (tap enabled)

@glazychev-art
Copy link
Contributor Author

Hi @ljkiraly ,
Ok, got it.

Built an image - artgl/cmd-forwarder-vpp:no_calico

@szvincze
Copy link

szvincze commented Oct 7, 2023

Hi,

Here I paste two consequent output of vppctl show hardware-interfaces command when the issue with AF_PACKET interface happen (left side is the earlier). The next block in RX Queue 0 is the same for both printout and the Pending Request: num-rx-pkts:23 next-frame-offset:302936 also exists.

              Name                Idx   Link  Hardware                                                                        Name                Idx   Link  Hardware
host-eth0                          1     up   host-eth0                                                         host-eth0                          1     up   host-eth0
  Link speed: unknown                                                                                             Link speed: unknown
  RX Queues:                                                                                                      RX Queues:
    queue thread         mode                                                                                       queue thread         mode
    0     main (0)       interrupt                                                                                  0     main (0)       interrupt
  TX Queues:                                                                                                      TX Queues:
    TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]                               TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]
    queue shared thread(s)                                                                                          queue shared thread(s)
    0     no     0                                                                                                  0     no     0
  Ethernet address 54:52:00:00:8b:10                                                                              Ethernet address 54:52:00:00:8b:10
  Linux PACKET socket interface v3                                                                                Linux PACKET socket interface v3
  FEATURES:                                                                                                       FEATURES:
  RX Queue 0:                                                                                                     RX Queue 0:
    block size:327680 nr:160  frame size:10240 nr:5120 next block:148                                               block size:327680 nr:160  frame size:10240 nr:5120 next block:148
    Pending Request: num-rx-pkts:23 next-frame-offset:302936                                                        Pending Request: num-rx-pkts:23 next-frame-offset:302936
  TX Queue 0:                                                                                                     TX Queue 0:
    block size:10485760 nr:1  frame size:10240 nr:1024 next frame:493                                               block size:10485760 nr:1  frame size:10240 nr:1024 next frame:642
    available:1024 request:0 sending:0 wrong:0 total:1024                                                           available:1024 request:0 sending:0 wrong:0 total:1024

Important to note that this happened with the artgl/cmd-forwarder-vpp:vpp_c3f505fe7b7f_no_calico_af_packet_v3 image. So, it seems the issue comes with that too but less frequent than with the other images we tried so far, because this is the first time we observed the issue using this image.

@glazychev-art
Copy link
Contributor Author

@szvincze
Thank you for the information, it's very interesting.
The main thing that became clear is that the problem is with the AF_PACKET host interface. Before this I assumed that the problem was between NSC/NSE and the forwarder..

Question:
What NetworkService.Payload do you use in your setup? Is it only Ethernet (and Vxlan respectively) or also IP (IPSec/Wireguard is used)?

@szvincze
Copy link

Question: What NetworkService.Payload do you use in your setup? Is it only Ethernet (and Vxlan respectively) or also IP (IPSec/Wireguard is used)?

@glazychev-art: It is only Ethernet.

@glazychev-art
Copy link
Contributor Author

Details

The problem is reproduced when the forwarder is removed:
kubectl delete pod -n nsm-system <forwarder-name>

Node 1:
forwarder-vpp-xwn6h
nsmgr-mdb6l
receiver-7f87fd788d-g95j9 (NSC)

Node 2:
forwarder-vpp-jtc57
nsmgr-vdllv
tg-6887d76c64-8fvf2 (NSE)

NSM-logs: Link
Direct traces forwarder-vpp-xwn6h: Link
Direct traces forwarder-vpp-jtc57: Link
Backward traces: Link

root@node-10-63-139-16:/# vppctl show errors
   Count                  Node                              Reason               Severity 
        10             null-node                      blackholed packets           error  
     42754            vxlan4-encap                good packets encapsulated        error  
     42754       acl-plugin-out-ip4-fa                ACL permit packets           error  
     42754       acl-plugin-out-ip4-fa                 checked packets             error  
         3         l2-output-bad-intf        L2 output to interface not in L2 mo   error  
     42757             l2-output                      L2 output packets            error  
     42757              l2-input                       L2 input packets            error 
tg-6887d76c64-8fvf2:/$ tcpdump -i nsc-g95j9 -v arp
tcpdump: listening on nsc-g95j9, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:14:30.542091 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.13 tell 169.254.0.14, length 28
12:14:30.542138 ARP, Ethernet (len 6), IPv4 (len 4), Reply 169.254.0.13 is-at 02:fe:ba:13:28:59 (oui Unknown), length 28
12:14:31.549309 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.13 tell 169.254.0.14, length 28
12:14:31.549322 ARP, Ethernet (len 6), IPv4 (len 4), Reply 169.254.0.13 is-at 02:fe:ba:13:28:59 (oui Unknown), length 28

root@node-10-63-139-16:/# vppctl show buffers
Pool Name            Index NUMA  Size  Data Size  Total  Avail  Cached   Used  
default-numa-0         0     0   4032     3776    32768  32250    512      6
node-10-63-139-16:~> sudo /usr/sbin/ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             256
RX Mini:        n/a
RX Jumbo:       n/a
TX:             256
Current hardware settings:
RX:             256
RX Mini:        n/a
RX Jumbo:       n/a
TX:             256
node-10-63-139-16:~> sudo /usr/sbin/ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:             n/a
TX:             n/a
Other:          n/a
Combined:       1
Current hardware settings:
RX:             n/a
TX:             n/a
Other:          n/a
Combined:       1

node-10-63-139-16:~> ls -l /sys/class/net/eth0/device/driver
lrwxrwxrwx 1 root root 0 okt   31 11.07 /sys/class/net/eth0/device/driver -> ../../../../../bus/virtio/drivers/virtio_net

@glazychev-art
Copy link
Contributor Author

How to reproduce

  1. Deploy kind cluster, NSM and Kernel2Ethernet2Kernel
  2. Find an IP address of the node, where NSC is located (172.18.0.4 in my case). This IP (and interface) is actually used by forwarder-vpp.
  3. Ping this interface from your host environment: sudo ping 172.18.0.4 -i 0,001 -s 65000
    As you can see, the purpose of this is to heavily load the node interface.
  4. Restart NSC forwarder: kubectl delete pod -n nsm-system -l app=forwarder-vpp --field-selector spec.nodeName=<node-name>
    (maybe several times)
  5. Ping NSC --> NSE doesn't work.

@glazychev-art
Copy link
Contributor Author

VPP patch - https://gerrit.fd.io/r/c/vpp/+/39824

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants