Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRR fails to install route received for an unknown but later-created VRF #13708

Closed
2 tasks done
mgreve opened this issue Jun 6, 2023 · 4 comments
Closed
2 tasks done
Assignees

Comments

@mgreve
Copy link

mgreve commented Jun 6, 2023


Describe the bug

  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

In this setup we have two hosts, host1 and host2. Each host has the same VRF device added and configured exactly in the same way. We add static routes to the hosts through our FRR configuration file and exchange the static routes via FRR (with a routeserver as the intermediary). We then repeat the below test in an "async fashion on the two hosts" until we see failure:

  • Setup vrf+bridge+vxlan devices on both hosts.
  • Configure FRR appropriately:
    • With mapping from vrf to vni.
    • Adding some static routes on both hosts.
  • Reload FRR.
  • Wait for a while.
  • Check that all static routes are present on both hosts. If not fail the test:
  • Tear down the vrf+bridge+vxlan devices, remove the static routes from the FRR config files, and the vrf to vni mapping and reload FRR.

A key point here is that the two hosts are doing these steps in a non-synchronized way. This means any host can receive a route in a VRF that it doesn't know of yet, but the VRF will be created soon thereafter. This seems to be a key condition in triggering the issue in this ticket.

At some point some or all of the static routes on host1 will fail to get installed on host2 (or vice versa). FRR will believe the routes have been installed, but at least one is not installed in the kernel. Note that:

  • No amount of waiting will fix this issue.
  • Reloading FRR will (almost always) get the route installed.

Note that the test is stopped as soon as we detect this failure. The log files will show many runs of the above test, and the last run is the failure. Also, we are not doing an FRR reload to alleviate the failure in the log file.

Expected behavior

The routes should always get installed into the kernel by FRR, i.e. the FRR RIB should never fall out of sync with the kernel for an indefinite amount of time.

Versions

  • OS Version: Debian 11.7 running inside a Docker container (with the Docker host being Ubuntu 20.04.6 LTS).
  • Kernel:
Linux 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • FRR Version:
FRR-8.5.1:  commit 7a2b85ae52b354248fa9da04100efba0ec6c70c9 (tag: frr-8.5.1, tag: docker/8.5.1, origin/rc/8.5) 

with two cherry-picked PRs on top of it:

lib, zebra: Fix EVPN nexthop config order #12524
zebra: re-install NHG on interface up

  • configure command used
./configure --prefix=/usr --includedir=\${prefix}/include --bindir=\${prefix}/bin --sbindir=\${prefix}/lib/frr --libdir=\${prefix}/lib/frr --libexecdir=\${prefix}/lib/frr --localstatedir=/var/run/frr --sysconfdir=/etc/frr --with-moduledir=\${prefix}/lib/frr/modules --enable-configfile-mask=0640 --enable-logfile-mask=0640 --enable-snmp=agentx --enable-multipath=64 --enable-user=frr --enable-group=frr --enable-vty-group=frrvty --with-pkg-git-version --with-pkg-extra-version=-MyOwnFRRVersion --enable-pimd --enable-watchfrr 

Additional context

Host1 - FRR VRF RIB routes:

vtysh -c 'show bgp vrf vrfv1000000 ipv4 unicast'

BGP table version is 13, local router ID is 10.250.0.1, vrf id 174
Default local pref 100, local AS 4245000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *> 0.0.0.0/0        0.0.0.0(h1-cph4.edited.com)
                                             0         32768 ?
 *                   10.250.0.2(rs1)<
                                             0             0 65000 4245000002 ?
 *> 10.5.89.1/32     169.154.1.1(h1-cph4.edited.com)
                                             0         32768 ?
 *> 10.5.89.2/32     169.154.1.1(h1-cph4.edited.com)
                                             0         32768 ?
 *> 10.5.89.3/32     10.250.0.2(rs1)<
                                             0             0 65000 4245000002 ?

Displayed  4 routes and 5 total paths

Host1 - routes installed in the VRF in the kernel (ip route show vrf ...):

ip route show vrf vrfv1000000

unreachable default proto 196 metric 20 
10.5.89.1 nhid 733 via 169.154.1.1 dev tap.506987_2 proto 196 metric 20 onlink 
10.5.89.2 nhid 741 via 169.154.1.1 dev tap.506988_2 proto 196 metric 20 onlink 

Note: The route to 10.5.89.3 is not installed in the kernel even though FRR claims it is.

Host2 - FRR VRF RIB routes:

vtysh -c 'show bgp vrf vrfv1000000 ipv4 unicast'

BGP table version is 4, local router ID is 10.250.0.2, vrf id 134
Default local pref 100, local AS 4245000002
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *  0.0.0.0/0        10.250.0.1(rs1)<
                                             0             0 65000 4245000001 ?
 *>                  0.0.0.0(h2-cph4.edited.com)
                                             0         32768 ?
 *> 10.5.89.1/32     10.250.0.1(rs1)<
                                             0             0 65000 4245000001 ?
 *> 10.5.89.2/32     10.250.0.1(rs1)<
                                             0             0 65000 4245000001 ?
 *> 10.5.89.3/32     169.154.1.1(h2-cph4.edited.com)
                                             0         32768 ?

Displayed  4 routes and 5 total paths

Host2 - routes installed in the VRF in the kernel (ip route show vrf ...):

ip route show vrf vrfv1000000

unreachable default proto 196 metric 20 
10.5.89.1 nhid 512 via 10.250.0.1 dev brv1000000 proto bgp metric 20 onlink 
10.5.89.2 nhid 512 via 10.250.0.1 dev brv1000000 proto bgp metric 20 onlink 
10.5.89.3 nhid 504 via 169.154.1.1 dev tap.506989_2 proto 196 metric 20 onlink 

On host2 the routes are installed correctly, but we have seen cases where the routes were not installed properly on host2.

Attached files

  • FRR config from host1 and host2.
  • FRR log file from host1 and host2.
  • Daemons file from host1 and host2 (identical).

host1.tar.gz
host2.tar.gz

They are compressed as they are fairly large (~10-20MB).

Partial log analysis

We believe the key part of the logs related to the issue is the below part in host1's FRR log file.

As far as we can tell this part means that the route 10.5.89.3 is in an unknown VRF, i.e. the VRF device is not yet present on host1. This means Zebra will (and cannot) install the route.

2023/06/06 11:33:03 BGP: [GKC5Y-XBAX9] vrf Unknown: import evpn prefix [5]:[0]:[32]:[10.5.89.3] parent 0x564406c82e50 flags 0x410
2023/06/06 11:33:03 BGP: [KZNVF-SX7KT] ... new pi dest 0x56440741b5c0 (l 2) pi 0x56440743d0c0 (l 1, f 0x4010)
2023/06/06 11:33:03 BGP: [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information
2023/06/06 11:33:03 BGP: [K423X-ETGCQ] group_announce_route_walkcb: afi=l2vpn, safi=evpn, p=[5]:[0]:[32]:[10.5.89.3]
2023/06/06 11:33:03 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=[5]:[0]:[32]:[10.5.89.3], selected=0x564406c82e50
2023/06/06 11:33:03 BGP: [JND8M-1N1QN] subgroup_announce_check: community filter check fail for [5]:[0]:[32]:[10.5.89.3]

3 seconds later, the VRF device is now present, and the route is processed again. However, Zebra does absolutely nothing with the route.

2023/06/06 11:33:06 BGP: [GKC5Y-XBAX9] vrf vrfv1000000: import evpn prefix [5]:[0]:[32]:[10.5.89.3] parent 0x564406c82e50 flags 0x498

The last log line in the log is several hours later, and the route has still not be installed:

2023/06/06 15:14:55 ZEBRA: [RTA3T-W4WDC] rtadv_event(default) with event: 4 and val: 0

Netdevices

Probably not relevant but it is here for completeness.

Host2:

ip -d link show vrf vrfv1000000; ip -d link show dev vrfv1000000
133: tap.506989_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master vrfv1000000 state UP mode DEFAULT group default qlen 10000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65521 
    tun type tap pi off vnet_hdr on persist on user edited506989 
    vrf_slave table 1000000 addrgenmode none numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
135: brv1000000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrfv1000000 state UP mode DEFAULT group default qlen 1000
    link/ether 26:77:78:53:06:04 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.26:77:78:53:6:4 designated_root 8000.26:77:78:53:6:4 root_port 0 root_path_cost 0 topology_change 0 topology_change_detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer   43.85 vlan_default_pvid 1 vlan_stats_enabled 0 vlan_stats_per_port 0 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 16 mcast_hash_max 4096 mcast_last_member_count 2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval 3124 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_version 1 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 
    vrf_slave table 1000000 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
134: vrfv1000000: <NOARP,MASTER,UP,LOWER_UP> mtu 65575 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether da:8c:4d:62:6b:9d brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 1280 maxmtu 65575 
    vrf table 1000000 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

Host1:

ip -d link show vrf vrfv1000000; ip -d link show dev vrfv1000000
173: tap.506987_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master vrfv1000000 state UP mode DEFAULT group default qlen 10000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65521 
    tun type tap pi off vnet_hdr on persist on user edited506987 
    vrf_slave table 1000000 addrgenmode none numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
175: brv1000000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrfv1000000 state UP mode DEFAULT group default qlen 1000
    link/ether 82:f8:64:91:6b:75 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.82:f8:64:91:6b:75 designated_root 8000.82:f8:64:91:6b:75 root_port 0 root_path_cost 0 topology_change 0 topology_change_detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer  192.92 vlan_default_pvid 1 vlan_stats_enabled 0 vlan_stats_per_port 0 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 16 mcast_hash_max 4096 mcast_last_member_count 2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval 3124 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_version 1 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 
    vrf_slave table 1000000 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
178: tap.506988_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master vrfv1000000 state UP mode DEFAULT group default qlen 10000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65521 
    tun type tap pi off vnet_hdr on persist on user edited506988 
    vrf_slave table 1000000 addrgenmode none numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
174: vrfv1000000: <NOARP,MASTER,UP,LOWER_UP> mtu 65575 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 8e:88:32:2e:6d:6a brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 1280 maxmtu 65575 
    vrf table 1000000 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
@mgreve mgreve added the triage Needs further investigation label Jun 6, 2023
Copy link

github-actions bot commented Dec 4, 2023

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

@frrbot
Copy link

frrbot bot commented Dec 4, 2023

This issue will be automatically closed in the specified period unless there is further activity.

@frrbot frrbot bot closed this as completed Dec 11, 2023
@frrbot frrbot bot removed the autoclose label Dec 11, 2023
@mgreve
Copy link
Author

mgreve commented Dec 15, 2023

This issue still persists.

@rzalamena rzalamena reopened this Dec 15, 2023
@rzalamena rzalamena added bgp zebra and removed triage Needs further investigation labels Dec 15, 2023
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue May 22, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description FRRouting#13708
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue May 22, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue Jun 23, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue Jun 23, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue Jun 23, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue Jun 24, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
piotrsuchy added a commit to piotrsuchy/frr that referenced this issue Jun 24, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here FRRouting#13708

Signed-off-by: Piotr Suchy <[email protected]>
mergify bot pushed a commit that referenced this issue Jun 28, 2024
Fix for a bug, where FRR fails to install route received for an unknown but later-created VRF - detailed description can be found here #13708

Signed-off-by: Piotr Suchy <[email protected]>
(cherry picked from commit 8044d73)
@ton31337 ton31337 self-assigned this Jul 3, 2024
@ton31337
Copy link
Member

ton31337 commented Jul 4, 2024

It should be fixed already by #16306.

@ton31337 ton31337 closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants