-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgpd: fixed failing to remove VRF if there is a stale l3vni #16059
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Squash into a single commit both changes;
- Please fix frrbot/styling issues;
- If this is an issue, then we might have issues not only for L3VNIs, but for ES, L2VNIs? To me it looks like an incomplete approach yet.
e0662ce
to
d341654
Compare
Problem statement: ================== When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state. It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself. 2023/10/06 06:22:18 ZEBRA: [JAESH-BABB8] Send L3_VNI_DEL 1001 VRF testVRF1001 to bgp 2023/10/06 06:22:18 ZEBRA: [XC3P3-1DG4D] MESSAGE: ZEBRA_VRF_DELETE testVRF1001 The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order. 2023/10/06 06:22:18 BGP: [VP18N-HB5R6] VRF testVRF1001(766) is to be deleted. 2023/10/06 06:22:18 BGP: [RH4KQ-X3CYT] VRF testVRF1001(766) is to be disabled. 2023/10/06 06:22:18 BGP: [X8ZE0-9TS5H] VRF disable testVRF1001 id 766 2023/10/06 06:22:18 BGP: [X67AQ-923PR] Deregistering VRF 766 2023/10/06 06:22:18 BGP: [K52W0-YZ4T8] VRF Deletion: testVRF1001(4294967295) .. and a bit later : 2023/10/06 06:22:18 BGP: [MRXGD-9MHNX] DJERNAES: process L3VNI 1001 DEL 2023/10/06 06:22:18 BGP: [NCEPE-BKB1G][EC 33554467] Cannot process L3VNI 1001 Del - Could not find BGP instance When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed. if (bgp->l3vni) { vty_out(vty, "%% Please unconfigure l3vni %u\n", bgp->l3vni); return CMD_WARNING_CONFIG_FAILED; } Solution: ========= The solution is to make bgp cleanup the l3vni a bgp instance is going down. The fix: ======== The fix is to add a function in bgp_evpn.c to be responsible for for deleting the local vni, if it should be needed, and call the function from bgp_instance_down(). Testing: ======== Created a test, which can run in container lab that remove the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop trigger the problem 18 times of 100 runs. After the fix it did not fail. To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call bgp_evpn_local_l3vni_del() to do the cleanup. This were hit 20 times in 100 test runs. Signed-off-by: Kacper Kwasny <[email protected]> bgpd: braces {} are not necessary for single line block Signed-off-by: Kacper Kwasny <[email protected]>
d341654
to
171d258
Compare
Squashed the commits and fixed the formatting. I apologize but unfortunately I do not have time right now to investigate your suggestion of the more holistic approach. Although I think I can pick this up later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good ... I wonder if we should wait 'til we can check into L2 as well, or if we should push this and open an issue to check on L2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. One question is still regarding L2 VNIs, but nobody reported anything about that, thus I assume it's not an issue for now.
@Mergifyio backport dev/10.1 |
✅ Backports have been created
|
bgpd: fixed failing to remove VRF if there is a stale l3vni (backport #16059)
Problem statement:
When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state.
It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself.
The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order.
.. and a bit later :
When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed.
Solution:
The solution is to make bgp cleanup the l3vni a bgp instance is going down.
The fix:
The fix is to add a function in
bgp_evpn.c
to be responsible for for deleting the local vni, if it should be needed, and call the function frombgp_instance_down()
.Testing:
Currently we have issues with reproduction of this bug. Before we had a test, that run in container lab that removes the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop triggered the problem 18 times of 100 runs. After the fix it did not fail.
To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call
bgp_evpn_local_l3vni_del()
to do the cleanup. This were hit 20 times in 100 test runs.