Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FRR] Fixing zebra to handle non notification of better admin won #17184

Merged
merged 2 commits into from
Nov 23, 2023

Conversation

dgsudharsan
Copy link
Collaborator

Why I did it

To fix #17183.
When a route is getting advertised from two paths through network command, only one of the paths is chosen. This was fixed in FRR through FRRouting/frr#14795

Work item tracking
  • Microsoft ADO (number only):

How I did it

Porting the FRR fix

How to verify it

Manually tested the scenario. Additionally verified through regression.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liat-grozovik
Copy link
Collaborator

@prsunny @bingwang-ms FYI this is must for 202305

@liat-grozovik liat-grozovik changed the title [FRR]Fixing zebra to handle non notification of better admin won [FRR] Fixing zebra to handle non notification of better admin won Nov 19, 2023
@dgsudharsan dgsudharsan marked this pull request as ready for review November 21, 2023 15:19
@dgsudharsan dgsudharsan requested a review from lguohan as a code owner November 21, 2023 15:19
@dgsudharsan dgsudharsan marked this pull request as draft November 21, 2023 15:24
@dgsudharsan dgsudharsan marked this pull request as ready for review November 21, 2023 19:26
@StormLiangMS
Copy link
Contributor

StormLiangMS commented Nov 22, 2023

hi @dgsudharsan two questions.

  1. the issue mentioned in zebra: Fix non-notification of better admin won FRRouting/frr#14795, it says that zebra hold something which BGP doesn't know, then when install from BGP to the zebra, zebra sent back to owner to notify. But in the issue which described in [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183., all routes are from BGP peers and send to zebra, right? Since the FRR pr could fix this issue by retrigger a route selection from BGP, is this a work around and some issue hide by this workaround?

  2. I didn't notice any failure from sonic-mgmt, I would identify this is a test gap, [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183, could you help to file a test gap in sonic-mgmt?

@dgsudharsan
Copy link
Collaborator Author

dgsudharsan commented Nov 22, 2023

hi @dgsudharsan two questions.

  1. the issue mentioned in zebra: Fix non-notification of better admin won FRRouting/frr#14795, it says that zebra hold something which BGP doesn't know, then when install from BGP to the zebra, zebra sent back to owner to notify. But in the issue which described in [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183., all routes are from BGP peers and send to zebra, right? Since the FRR pr could fix this issue by retrigger a route selection from BGP, is this a work around and some issue hide by this workaround?
  2. I didn't notice any failure from sonic-mgmt, I would identify this is a test gap, [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183, could you help to file a test gap in sonic-mgmt?

@StormLiangMS Please check the comment here FRRouting/frr#14795 (comment)
Essentially the issue could be fixed in two ways 1 - Zebra performing proper notification, 2 - BGP not assuming selection of the routes. 1 is chosen since it is easier to debug and diagnose problems better.

Regarding the test gap, do you want us to file a test issue in sonic-mgmt?

@StormLiangMS
Copy link
Contributor

hi @dgsudharsan two questions.

  1. the issue mentioned in zebra: Fix non-notification of better admin won FRRouting/frr#14795, it says that zebra hold something which BGP doesn't know, then when install from BGP to the zebra, zebra sent back to owner to notify. But in the issue which described in [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183., all routes are from BGP peers and send to zebra, right? Since the FRR pr could fix this issue by retrigger a route selection from BGP, is this a work around and some issue hide by this workaround?
  2. I didn't notice any failure from sonic-mgmt, I would identify this is a test gap, [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183, could you help to file a test gap in sonic-mgmt?

@StormLiangMS Please check the comment here FRRouting/frr#14795 (comment) Essentially the issue could be fixed in two ways 1 - Zebra performing proper notification, 2 - BGP not assuming selection of the routes. 1 is chosen since it is easier to debug and diagnose problems better.

Regarding the test gap, do you want us to file a test issue in sonic-mgmt?

Yes, pls file a test issue in sonic-mgmt.

@StormLiangMS
Copy link
Contributor

StormLiangMS commented Nov 22, 2023

hi @dgsudharsan two questions.

  1. the issue mentioned in zebra: Fix non-notification of better admin won FRRouting/frr#14795, it says that zebra hold something which BGP doesn't know, then when install from BGP to the zebra, zebra sent back to owner to notify. But in the issue which described in [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183., all routes are from BGP peers and send to zebra, right? Since the FRR pr could fix this issue by retrigger a route selection from BGP, is this a work around and some issue hide by this workaround?
  2. I didn't notice any failure from sonic-mgmt, I would identify this is a test gap, [FRR]In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. #17183, could you help to file a test gap in sonic-mgmt?

@StormLiangMS Please check the comment here FRRouting/frr#14795 (comment) Essentially the issue could be fixed in two ways 1 - Zebra performing proper notification, 2 - BGP not assuming selection of the routes. 1 is chosen since it is easier to debug and diagnose problems better.
Regarding the test gap, do you want us to file a test issue in sonic-mgmt?

Yes, pls file a test issue in sonic-mgmt.

For the issue, can I understand like below.

  1. Let's name ToR as A, and one leaf as B, the other leaf is C.
  2. When A received network route from B, A will install it with Zebra, and succeed.
  3. When A received network route from C, when A install it by Zebra, but Zebra think it is not better one and drop it, but BGP doesn't know about it.

with this patch:
4. BGP will receive message ZAPI_ROUTE_BETTER_ADMIN_WON, then A will send routes with 2 neighbor nexthop links to Zebra again, then Zebra will take it and install them?

And one more question, why this issue only happened with network command route but not others?

@dgsudharsan
Copy link
Collaborator Author

  • When A received network route from B, A will install it with Zebra, and succeed.

I think this is correct. Why it happens only with network command, I am not sure. @donaldsharp Can you please share your insights?

@donaldsharp
Copy link

Without the network command BGP would just have a route from a peer that was not installed which it would never indicate to another peer with bgp suppress fib turned on. So BGP was in a broken state but the broken state was irrelevant because suppress-fib would never allow the route to be passed on to peers

@dgsudharsan
Copy link
Collaborator Author

@StormLiangMS Here is the test gap I filed sonic-net/sonic-mgmt#10837

@StormLiangMS
Copy link
Contributor

thanks for this details. @dgsudharsan @donaldsharp

Copy link
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit 8aa6a74 into sonic-net:master Nov 23, 2023
@StormLiangMS
Copy link
Contributor

ADO: 25944682

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Nov 23, 2023
…nic-net#17184)

* [FRR]Fixing zebra to handle non notification of better admin won

* Updating the patch with latest changes from FRR
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #17279

mssonicbld pushed a commit that referenced this pull request Nov 23, 2023
…7184)

* [FRR]Fixing zebra to handle non notification of better admin won

* Updating the patch with latest changes from FRR
yxieca pushed a commit that referenced this pull request Dec 4, 2023
…7184)

* [FRR]Fixing zebra to handle non notification of better admin won

* Updating the patch with latest changes from FRR
StormLiangMS pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Oct 23, 2024
…the TOR (#14683)

Description of PR
Summary:
Fixes # (issue)
#10837

Orignal Problem
In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. Doing clear bgp * recovers the device from this scenario.

If the same route (be it default or non-default) is advertised from two leaf to a ToR, the leaf initially shows only one leaf's links as next hop. When we do clear bgp *, it is then updated.

B>* 20.0.0.1/32 [20/0] via 192.168.1.2, PortChannel1, weight 1, 00:16:52

               via 192.168.2.2, PortChannel2, weight 1, 00:16:52
               via 192.168.3.2, PortChannel3, weight 1, 00:16:52
               via 192.168.4.2, PortChannel4, weight 1, 00:16:52
After doing clear bgp * in tgn-sonic-n2-t1 I can see this

B>* 20.0.0.1/32 [20/0] via 192.168.1.2, PortChannel1, weight 1, 00:00:02

               via 192.168.2.2, PortChannel2, weight 1, 00:00:02
               via 192.168.3.2, PortChannel3, weight 1, 00:00:02
               via 192.168.4.2, PortChannel4, weight 1, 00:00:02
               via 192.168.5.2, PortChannel5, weight 1, 00:00:02
               via 192.168.6.2, PortChannel6, weight 1, 00:00:02
               via 192.168.8.2, PortChannel8, weight 1, 00:00:02

This issue was fixed in PR sonic-net/sonic-buildimage#17184
However a test for this was missing.
so this test adds a prefix on 2 T1 neighbors using network command and checks if the advertised route is learnt on the T0 DUT
sreejithsreekumaran pushed a commit to sreejithsreekumaran/sonic-mgmt that referenced this pull request Nov 15, 2024
…the TOR (sonic-net#14683)

Description of PR
Summary:
Fixes # (issue)
sonic-net#10837

Orignal Problem
In multipath scenario when a route is advertised through network command from two neighbors, only one path is chosen. Doing clear bgp * recovers the device from this scenario.

If the same route (be it default or non-default) is advertised from two leaf to a ToR, the leaf initially shows only one leaf's links as next hop. When we do clear bgp *, it is then updated.

B>* 20.0.0.1/32 [20/0] via 192.168.1.2, PortChannel1, weight 1, 00:16:52

               via 192.168.2.2, PortChannel2, weight 1, 00:16:52
               via 192.168.3.2, PortChannel3, weight 1, 00:16:52
               via 192.168.4.2, PortChannel4, weight 1, 00:16:52
After doing clear bgp * in tgn-sonic-n2-t1 I can see this

B>* 20.0.0.1/32 [20/0] via 192.168.1.2, PortChannel1, weight 1, 00:00:02

               via 192.168.2.2, PortChannel2, weight 1, 00:00:02
               via 192.168.3.2, PortChannel3, weight 1, 00:00:02
               via 192.168.4.2, PortChannel4, weight 1, 00:00:02
               via 192.168.5.2, PortChannel5, weight 1, 00:00:02
               via 192.168.6.2, PortChannel6, weight 1, 00:00:02
               via 192.168.8.2, PortChannel8, weight 1, 00:00:02

This issue was fixed in PR sonic-net/sonic-buildimage#17184
However a test for this was missing.
so this test adds a prefix on 2 T1 neighbors using network command and checks if the advertised route is learnt on the T0 DUT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants