Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: fixed failing to remove VRF if there is a stale l3vni #16059

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

kacpekwasny
Copy link
Contributor

Problem statement:

When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state.
It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself.

2023/10/06 06:22:18 ZEBRA: [JAESH-BABB8] Send L3_VNI_DEL 1001 VRF testVRF1001 to bgp
2023/10/06 06:22:18 ZEBRA: [XC3P3-1DG4D] MESSAGE: ZEBRA_VRF_DELETE testVRF1001

The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order.

2023/10/06 06:22:18 BGP: [VP18N-HB5R6] VRF testVRF1001(766) is to be deleted.
2023/10/06 06:22:18 BGP: [RH4KQ-X3CYT] VRF testVRF1001(766) is to be disabled.
2023/10/06 06:22:18 BGP: [X8ZE0-9TS5H] VRF disable testVRF1001 id 766
2023/10/06 06:22:18 BGP: [X67AQ-923PR] Deregistering VRF 766
2023/10/06 06:22:18 BGP: [K52W0-YZ4T8] VRF Deletion: testVRF1001(4294967295)

.. and a bit later :

2023/10/06 06:22:18 BGP: [MRXGD-9MHNX] DJERNAES: process L3VNI 1001 DEL
2023/10/06 06:22:18 BGP: [NCEPE-BKB1G][EC 33554467] Cannot process L3VNI 1001 Del - Could not find BGP instance

When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed.

    if (bgp->l3vni) {
        vty_out(vty, "%% Please unconfigure l3vni %u\n",
            bgp->l3vni);
        return CMD_WARNING_CONFIG_FAILED;
    }

Solution:

The solution is to make bgp cleanup the l3vni a bgp instance is going down.

The fix:

The fix is to add a function in bgp_evpn.c to be responsible for for deleting the local vni, if it should be needed, and call the function from bgp_instance_down().

Testing:

Currently we have issues with reproduction of this bug. Before we had a test, that run in container lab that removes the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop triggered the problem 18 times of 100 runs. After the fix it did not fail.

To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call bgp_evpn_local_l3vni_del() to do the cleanup. This were hit 20 times in 100 test runs.

Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Squash into a single commit both changes;
  • Please fix frrbot/styling issues;
  • If this is an issue, then we might have issues not only for L3VNIs, but for ES, L2VNIs? To me it looks like an incomplete approach yet.

@kacpekwasny kacpekwasny force-pushed the kkwasny/CLIC-139-4 branch from e0662ce to d341654 Compare May 27, 2024 09:04
Problem statement:
==================
When a vrf is deleted from the kernel, before its removed from the FRR
config, zebra gets to delete the the vrf and assiciated state.

It does so by sending a request to delete the l3 vni associated with the
vrf followed by a request to delete the vrf itself.

2023/10/06 06:22:18 ZEBRA: [JAESH-BABB8] Send L3_VNI_DEL 1001 VRF
testVRF1001 to bgp
2023/10/06 06:22:18 ZEBRA: [XC3P3-1DG4D] MESSAGE: ZEBRA_VRF_DELETE
testVRF1001

The zebra client communication is asynchronous and about 1/5 cases the
bgp client process them in a different order.

2023/10/06 06:22:18 BGP: [VP18N-HB5R6] VRF testVRF1001(766) is to be
deleted.
2023/10/06 06:22:18 BGP: [RH4KQ-X3CYT] VRF testVRF1001(766) is to be
disabled.
2023/10/06 06:22:18 BGP: [X8ZE0-9TS5H] VRF disable testVRF1001 id 766
2023/10/06 06:22:18 BGP: [X67AQ-923PR] Deregistering VRF 766
2023/10/06 06:22:18 BGP: [K52W0-YZ4T8] VRF Deletion:
testVRF1001(4294967295)
.. and a bit later :
2023/10/06 06:22:18 BGP: [MRXGD-9MHNX] DJERNAES: process L3VNI 1001 DEL
2023/10/06 06:22:18 BGP: [NCEPE-BKB1G][EC 33554467] Cannot process L3VNI
1001 Del - Could not find BGP instance

When the bgp vrf config is removed later it fails on the sanity check if
l3vni is removed.

        if (bgp->l3vni) {
            vty_out(vty, "%% Please unconfigure l3vni %u\n",
                bgp->l3vni);
            return CMD_WARNING_CONFIG_FAILED;
        }

Solution:
=========
The solution is to make bgp cleanup the l3vni a bgp instance is going
down.

The fix:
========
The fix is to add a function in bgp_evpn.c to be responsible for for
deleting the local vni, if it should be needed, and call the function
from bgp_instance_down().

Testing:
========
Created a test, which can run in container lab that remove the vrf on
the host before removing the vrf and the bgp config form frr. Running
this test in a loop trigger the problem 18 times of 100 runs. After the
fix it did not fail.

To verify the fix a log message (which is not in the code any longer)
were used when we had a stale l3vni and needed to call
bgp_evpn_local_l3vni_del() to do the cleanup. This were hit 20 times in
100 test runs.

Signed-off-by: Kacper Kwasny <[email protected]>

bgpd: braces {} are not necessary for single line block

Signed-off-by: Kacper Kwasny <[email protected]>
@kacpekwasny kacpekwasny force-pushed the kkwasny/CLIC-139-4 branch from d341654 to 171d258 Compare May 27, 2024 12:13
@kacpekwasny
Copy link
Contributor Author

Squashed the commits and fixed the formatting. I apologize but unfortunately I do not have time right now to investigate your suggestion of the more holistic approach. Although I think I can pick this up later.

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good ... I wonder if we should wait 'til we can check into L2 as well, or if we should push this and open an issue to check on L2?

@donaldsharp donaldsharp requested a review from chiragshah6 May 28, 2024 15:18
@riw777 riw777 added the bugfix label Jun 4, 2024
@github-actions github-actions bot added the rebase PR needs rebase label Jun 4, 2024
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. One question is still regarding L2 VNIs, but nobody reported anything about that, thus I assume it's not an issue for now.

@ton31337
Copy link
Member

@Mergifyio backport dev/10.1

Copy link

mergify bot commented Jun 20, 2024

backport dev/10.1

✅ Backports have been created

@ton31337 ton31337 merged commit 34a6e22 into FRRouting:master Jun 20, 2024
8 checks passed
@kacpekwasny kacpekwasny deleted the kkwasny/CLIC-139-4 branch June 20, 2024 14:17
ton31337 added a commit that referenced this pull request Jun 21, 2024
bgpd: fixed failing to remove VRF if there is a stale l3vni (backport #16059)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants