-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[routesync] Fix for stale dynamic neighbor #2553
Conversation
|
@vganesan-nokia , i think you may want to follow the CLA guidelines to sign. |
/easycla |
@prsunny please check on this. |
/azp run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
69ecd73
to
a923a54
Compare
@prsunny: Rebased and force pushed. I still see vstest failures (test_warmreboot.py). |
ok, seems like this test is consistently failing on the warmboot, can you please run the test manually? |
Signed-off-by: vedganes <[email protected]> This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces.
@prsunny: Fixed unit test failures. Please make arrangements to get this reviewed. |
@prsunny, can you check the change related to warm reboot. |
* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <[email protected]>
Update sonic-swss submodule pointer to include the following: * a2a483d [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME ([sonic-net#2617](sonic-net/sonic-swss#2617)) * 9d1f66b [bfdorch] add local discriminator to state DB ([sonic-net#2629](sonic-net/sonic-swss#2629)) * c54b3d1 Vxlan tunnel endpoint custom monitoring APPL DB table. ([sonic-net#2589](sonic-net/sonic-swss#2589)) * 7f03db2 Fix potential risks ([sonic-net#2516](sonic-net/sonic-swss#2516)) * 383ee68 [refactor]Refactoring sai handle status ([sonic-net#2621](sonic-net/sonic-swss#2621)) * cd95972 Fix issue 13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([sonic-net#2619](sonic-net/sonic-swss#2619)) * a01470f Remove TODO comments that are no longer relevant ([sonic-net#2622](sonic-net/sonic-swss#2622)) * d058390 Changed the BFD default detect multiplier to 10x ([sonic-net#2614](sonic-net/sonic-swss#2614)) * d78b528 [MuxOrch] Enabling neighbor when adding in active state ([sonic-net#2601](sonic-net/sonic-swss#2601)) * 4ebdad1 [routesync] Fix for stale dynamic neighbor ([sonic-net#2553](sonic-net/sonic-swss#2553)) * 8857f92 Added new attributes for Vnet and Vxlan ecmp configurations. ([sonic-net#2584](sonic-net/sonic-swss#2584)) * b6bbc3e Revert [voq][chassis]Add show fabric counters port/queue commands (2522) ([sonic-net#2611](sonic-net/sonic-swss#2611)) * 52406e2 Add missing parameter to on_switch_shutdown_request method. ([sonic-net#2567](sonic-net/sonic-swss#2567)) * 4ac9ad9 Increase diff coverage to 80% ([sonic-net#2599](sonic-net/sonic-swss#2599)) * 8a0bb36 Handle Mac address 'none' ([sonic-net#2593](sonic-net/sonic-swss#2593)) * f496ab3 [vstest] Only collect stdout of orchagent_restart_check in vstest ([sonic-net#2597](sonic-net/sonic-swss#2597)) * 1dab495 Avoid aborting orchagent when setting TUNNEL attributes ([sonic-net#2591](sonic-net/sonic-swss#2591)) * 4395cea Fix neighbor doesn't update all attribute ([sonic-net#2577](sonic-net/sonic-swss#2577)) Signed-off-by: dprital <[email protected]>
Update sonic-swss submodule pointer to include the following: * a2a483d [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME ([#2617](sonic-net/sonic-swss#2617)) * 9d1f66b [bfdorch] add local discriminator to state DB ([#2629](sonic-net/sonic-swss#2629)) * c54b3d1 Vxlan tunnel endpoint custom monitoring APPL DB table. ([#2589](sonic-net/sonic-swss#2589)) * 7f03db2 Fix potential risks ([#2516](sonic-net/sonic-swss#2516)) * 383ee68 [refactor]Refactoring sai handle status ([#2621](sonic-net/sonic-swss#2621)) * cd95972 Fix issue 13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([#2619](sonic-net/sonic-swss#2619)) * a01470f Remove TODO comments that are no longer relevant ([#2622](sonic-net/sonic-swss#2622)) * d058390 Changed the BFD default detect multiplier to 10x ([#2614](sonic-net/sonic-swss#2614)) * d78b528 [MuxOrch] Enabling neighbor when adding in active state ([#2601](sonic-net/sonic-swss#2601)) * 4ebdad1 [routesync] Fix for stale dynamic neighbor ([#2553](sonic-net/sonic-swss#2553)) * 8857f92 Added new attributes for Vnet and Vxlan ecmp configurations. ([#2584](sonic-net/sonic-swss#2584)) * b6bbc3e Revert [voq][chassis]Add show fabric counters port/queue commands (2522) ([#2611](sonic-net/sonic-swss#2611)) * 52406e2 Add missing parameter to on_switch_shutdown_request method. ([#2567](sonic-net/sonic-swss#2567)) * 4ac9ad9 Increase diff coverage to 80% ([#2599](sonic-net/sonic-swss#2599)) * 8a0bb36 Handle Mac address 'none' ([#2593](sonic-net/sonic-swss#2593)) * f496ab3 [vstest] Only collect stdout of orchagent_restart_check in vstest ([#2597](sonic-net/sonic-swss#2597)) * 1dab495 Avoid aborting orchagent when setting TUNNEL attributes ([#2591](sonic-net/sonic-swss#2591)) * 4395cea Fix neighbor doesn't update all attribute ([#2577](sonic-net/sonic-swss#2577)) Signed-off-by: dprital <[email protected]>
*Merge remote-tracking branch 'upstream/master' into dash (#2663) * Modify coppmgr mergeConfig to support preserving copp tables through reboot. (#2548) * Avoid aborting orchagent when setting TUNNEL attributes (#2591) * Handle Mac address 'none' (#2593) * Increase diff coverage to 80% (#2599) * Add missing parameter to on_switch_shutdown_request method. (#2567) * Add ZMQ based ProducerStateTable and CustomerStateTable. * Revert "[voq][chassis]Add show fabric counters port/queue commands (#2522)" (#2611) * Added new attributes for Vnet and Vxlan ecmp configurations. (#2584) * added support for monitoring, primary and adv_prefix and overlay_dmac. * [routesync] Fix for stale dynamic neighbor (#2553) * [MuxOrch] Enabling neighbor when adding in active state (#2601) * Changed the BFD default detect multiplier to 10x (#2614) * Remove TODO comments that are no longer relevant (#2622) * Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (#2619) * [refactor]Refactoring sai handle status (#2621) * Vxlan tunnel endpoint custom monitoring APPL DB table. (#2589) * added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac * Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints. * [bfdorch] add local discriminator to state DB (#2629) * [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME (#2617) * [voq][chassis] Remove created ports from the default vlan. (#2607) * [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (#2642) *Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point. * [test_mux] add sleep in test_NH (#2648) * [autoneg]Fixing adv interface types to be set when AN is disabled (#2638) * [hash]: Add UT infra. (#2660) *Added UT infra for Generic Hash feature *Aligned PBH tests with Generic Hash UT infra * [sai_failure_dump]Invoking dump during SAI failure (#2644) * [ResponsePublisher] add pipeline support (#2511) * [dash] Fix compilation issue caused by missing include.
@vganesan-nokia , I notived that this PR is not included in 202211 branch so I added a label for it. @StormLiangMS , can you please cherry pick it to 202211 ? |
* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <[email protected]>
This reverts commit cc768db.
) *Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663) * Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548) * Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591) * Handle Mac address 'none' (sonic-net#2593) * Increase diff coverage to 80% (sonic-net#2599) * Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567) * Add ZMQ based ProducerStateTable and CustomerStateTable. * Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611) * Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584) * added support for monitoring, primary and adv_prefix and overlay_dmac. * [routesync] Fix for stale dynamic neighbor (sonic-net#2553) * [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601) * Changed the BFD default detect multiplier to 10x (sonic-net#2614) * Remove TODO comments that are no longer relevant (sonic-net#2622) * Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619) * [refactor]Refactoring sai handle status (sonic-net#2621) * Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589) * added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac * Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints. * [bfdorch] add local discriminator to state DB (sonic-net#2629) * [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME (sonic-net#2617) * [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607) * [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642) *Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point. * [test_mux] add sleep in test_NH (sonic-net#2648) * [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638) * [hash]: Add UT infra. (sonic-net#2660) *Added UT infra for Generic Hash feature *Aligned PBH tests with Generic Hash UT infra * [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644) * [ResponsePublisher] add pipeline support (sonic-net#2511) * [dash] Fix compilation issue caused by missing include.
) *Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663) * Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548) * Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591) * Handle Mac address 'none' (sonic-net#2593) * Increase diff coverage to 80% (sonic-net#2599) * Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567) * Add ZMQ based ProducerStateTable and CustomerStateTable. * Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611) * Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584) * added support for monitoring, primary and adv_prefix and overlay_dmac. * [routesync] Fix for stale dynamic neighbor (sonic-net#2553) * [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601) * Changed the BFD default detect multiplier to 10x (sonic-net#2614) * Remove TODO comments that are no longer relevant (sonic-net#2622) * Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619) * [refactor]Refactoring sai handle status (sonic-net#2621) * Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589) * added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac * Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints. * [bfdorch] add local discriminator to state DB (sonic-net#2629) * [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME (sonic-net#2617) * [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607) * [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642) *Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point. * [test_mux] add sleep in test_NH (sonic-net#2648) * [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638) * [hash]: Add UT infra. (sonic-net#2660) *Added UT infra for Generic Hash feature *Aligned PBH tests with Generic Hash UT infra * [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644) * [ResponsePublisher] add pipeline support (sonic-net#2511) * [dash] Fix compilation issue caused by missing include.
) *Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663) * Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548) * Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591) * Handle Mac address 'none' (sonic-net#2593) * Increase diff coverage to 80% (sonic-net#2599) * Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567) * Add ZMQ based ProducerStateTable and CustomerStateTable. * Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611) * Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584) * added support for monitoring, primary and adv_prefix and overlay_dmac. * [routesync] Fix for stale dynamic neighbor (sonic-net#2553) * [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601) * Changed the BFD default detect multiplier to 10x (sonic-net#2614) * Remove TODO comments that are no longer relevant (sonic-net#2622) * Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619) * [refactor]Refactoring sai handle status (sonic-net#2621) * Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589) * added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac * Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints. * [bfdorch] add local discriminator to state DB (sonic-net#2629) * [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME (sonic-net#2617) * [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607) * [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642) *Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point. * [test_mux] add sleep in test_NH (sonic-net#2648) * [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638) * [hash]: Add UT infra. (sonic-net#2660) *Added UT infra for Generic Hash feature *Aligned PBH tests with Generic Hash UT infra * [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644) * [ResponsePublisher] add pipeline support (sonic-net#2511) * [dash] Fix compilation issue caused by missing include.
// In this case since we do not want the route with next hop on eth0/docker0, we return. | ||
// But still we need to clear the route from the APPL_DB. Otherwise the APPL_DB and data | ||
// path will be left with stale route entry | ||
if(alsv.size() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the context, it seems for del msg, should we check it by
if((nlmsg_type == RTM_DELROUTE) && (alsv.size() == 1))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vganesan-nokia , do you want to address the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even it is a RTM_NEWROUTE on eth0/docker0, it may run into
if (!warmRestartInProgress)
{
...
m_routeTable.del(destipprefix)
}
then the route is deleted from APPL_DB and ASIC_DB ?
It happened on my real device with default route (0.0.0.0/0) is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vganesan-nokia , do you want to address the comment?
Yes. I'll make a PR to fix this.
Changes done to stale neighbor fix PR#2553 (sonic-net#2553) Changes include adding check to delete the stale neighbor from ASIC_DB and APPL_DB only if the kernel command is RTM_DELROUTE Signed-off-by: vedganes <[email protected]>
What I did
Changes are done to delete those route entries from APPL_DB which have next hop on eth0 or docker0 interface.
Why I did it
This is done to fix the issue sonic-net/sonic-buildimage#12442.
For more information about root cause and the solution, please see details below
How I verified it
Tested using the following scenario
-Shutdown interfaces such that all the eBGP neighbors that advertised the default route are down.
Details if related
The root cause of this problem is that some routes learned from eBGP neighbors are not cleared from APPL_DB when the eBGP neighbors go down. The routes which show this issue are those routes which include the next hop on interface "eth0". This consistently occurs for default routes for both IPv4 and IPv6 in any asic instance. Since the route is not deleted from APPL DB (though route is deleted from kernel), the neighbor ref_count is not 0 and hence the neighbor is not deleted from the ASIC_DB.
All asic instances include the default route with next hop on eth0 (with large metric to avoid colliding with default routes learned via eBGP). The APPL_DB has this route with only the eBGP next hops due to a filtering in routesync that filters the kernel route updates that has a next hop on eth0 or docker0. Consequently, the next hop on eth0 is not included in the route entry in APPL_DB and ASIC_DB. When all the eBGP neighbor interfaces are shut down, all eBGP neighbors go down. When all eBGP neighbors go down, the default route is withdrawn. The next hop on eth0 becomes the only next hop for the default route. When kernel sends this update, because of the the above mentioned routesync filter the route update is not sent to APPL_DB. Hence the APPL_DB is left with a stale route entry with whatever next hops it had before the eth0 next hop became only next hop. Since these next hops still have this route referenced, these are not cleared when neighbors are flushed.
The soultion to both of these issues viz., (1) stale route entry in APPL_DB and ASIC_DB and (2) stale neighbor entry in ASIC_DB is to delete the routes from APPL_DB when kernel updates are received with next hop on "eth0" as the only next hop.