Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workaround] EvpnRemoteVnip2pOrch warmboot check failure #2626

Merged
merged 3 commits into from
Feb 13, 2023

Conversation

jcaiMR
Copy link
Contributor

@jcaiMR jcaiMR commented Jan 20, 2023

What I did
Workaround for issue #12361, let wr_arp function pass first, will use another PR for root cause fix.

The trigger of this issue is warm-restart check failure, and EvpnRemoteVnip2pOrch::addOperation() caused it.
it always return false because of evpn_orch->getEVPNVtep() always return false cause warmboot check failure.
The issue only happened on Broadcom platform since it support P2P vxlan tunnel, and will run into EvpnRemoteVnip2pOrch.
Nvidia SAI currently only support P2MP tunnel, so no issue on that platform.

Jan 18 11:37:14.701069 str-dx010-acs-5 WARNING swss#orchagent: :- addOperation: Vxlan tunnelPort doesn't exist: 192.168.8.1
Jan 18 11:37:14.701177 str-dx010-acs-5 WARNING swss#orchagent: :- addTunnelUser: Unable to find EVPN VTEP. user=0 remote_vtep=19

Seems vtep instance only created when tunnel creation source is EVPN. If non EVPN case, source vtep not exists and it should return true to pass warmboot check.

VxlanTunnel::VxlanTunnel(string name, IpAddress srcIp, IpAddress dstIp, tunnel_creation_src_t src) 

                :tunnel_name_(name), src_ip_(srcIp), dst_ip_(dstIp), src_creation_(src) 

{ 
   VxlanTunnelOrch* tunnel_orch = gDirectory.get<VxlanTunnelOrch*>(); 
   if (dstIp.isZero()) 
   { 
       tunnel_orch->addVTEP(this, srcIp); 
       vtep_ptr = NULL; 
   } 
   **else if (src_creation_ == TNL_CREATION_SRC_EVPN)**  
   { 
       vtep_ptr = tunnel_orch->getVTEP(srcIp); 
       tunnel_orch->addRemoveStateTableEntry(name,srcIp, dstIp, src, true); 
   } 
} 

So here the fix is let EvpnRemoteVnip2pOrch::addOperation() return true when local VTEP is not exists, the logic same like EvpnRemoteVnip2mpOrch::addOperation().

Why I did it

How I verified it
run arp/test_wr_arp.py on Broadcom platform.
image

Details if related

@jcaiMR jcaiMR requested a review from prsunny as a code owner January 20, 2023 06:59
@prsunny prsunny requested a review from dgsudharsan January 20, 2023 20:18
@prsunny
Copy link
Collaborator

prsunny commented Jan 20, 2023

@srj102 , please review this as it is a blocker and fixing a warmboot issue

@dgsudharsan
Copy link
Collaborator

@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?

@jcaiMR
Copy link
Contributor Author

jcaiMR commented Jan 27, 2023

@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?

Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.

 if (vxlan_tunnel_orch->isDipTunnelsSupported())
    {
        EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }
    else
    {
        EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }

And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ?

attached some trace during warmboot:

Thread 1 "orchagent" hit Breakpoint 1, EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
2476    vxlanorch.cpp: No such file or directory.
(gdb) bt
#0  EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
#1  0x000055ecbb8d1c60 in Orch2::doTask (this=0x55ecbd1e7af0, consumer=...) at orch.cpp:1063
#2  0x000055ecbb8d5a9e in Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:241
#3  Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:238
#4  Consumer::execute (this=0x55ecbd1e85b0) at orch.cpp:235
#5  0x000055ecbb8c54c9 in OrchDaemon::start (this=this@entry=0x55ecbd0bfd80) at orchdaemon.cpp:755
#6  0x000055ecbb843fe0 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735
(gdb) c

@jcaiMR jcaiMR requested a review from vaibhavhd January 30, 2023 10:38
prsunny
prsunny previously approved these changes Jan 31, 2023
@prsunny
Copy link
Collaborator

prsunny commented Jan 31, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

vaibhavhd
vaibhavhd previously approved these changes Feb 1, 2023
@srj102
Copy link
Contributor

srj102 commented Feb 1, 2023

@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?

Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.

 if (vxlan_tunnel_orch->isDipTunnelsSupported())
    {
        EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }
    else
    {
        EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }

And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ?

attached some trace during warmboot:

Thread 1 "orchagent" hit Breakpoint 1, EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
2476    vxlanorch.cpp: No such file or directory.
(gdb) bt
#0  EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
#1  0x000055ecbb8d1c60 in Orch2::doTask (this=0x55ecbd1e7af0, consumer=...) at orch.cpp:1063
#2  0x000055ecbb8d5a9e in Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:241
#3  Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:238
#4  Consumer::execute (this=0x55ecbd1e85b0) at orch.cpp:235
#5  0x000055ecbb8c54c9 in OrchDaemon::start (this=this@entry=0x55ecbd0bfd80) at orchdaemon.cpp:755
#6  0x000055ecbb843fe0 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735
(gdb) c

Having the p2premotevniorch in the orchList does not explain why there was an event.
Could you dump the contents of the VXLAN_REMOTE_VNI table before the WB command is issued ?
Does it have any entries ?

keys VXLAN_REMOTE_VNI

orchagent/vxlanorch.cpp Show resolved Hide resolved
EvpnNvoOrch* evpn_orch = gDirectory.get<EvpnNvoOrch*>();
auto vtep_ptr = evpn_orch->getEVPNVtep();
if (!vtep_ptr) {
SWSS_LOG_WARN("Remote VNI add: Source VTEP not found. remote=%s vid=%d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this log say during warm boot ? Does it point to anything valid ?
ie what is the remote ip and what is the vid ?
If it is some spurious data then we need to investigate that as this can be seen on any of the other tables in other OAs.

Quite clearly the null check cannot be the fix especially in a non EVPN scenario.
We would need to see why is this being called when there is no explanation for the presence of an entry in the VXLAN_REMOTE_VNI table.

Copy link
Contributor Author

@jcaiMR jcaiMR Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?

Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.

 if (vxlan_tunnel_orch->isDipTunnelsSupported())
    {
        EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }
    else
    {
        EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }

And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ?
attached some trace during warmboot:

Thread 1 "orchagent" hit Breakpoint 1, EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
2476    vxlanorch.cpp: No such file or directory.
(gdb) bt
#0  EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476
#1  0x000055ecbb8d1c60 in Orch2::doTask (this=0x55ecbd1e7af0, consumer=...) at orch.cpp:1063
#2  0x000055ecbb8d5a9e in Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:241
#3  Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:238
#4  Consumer::execute (this=0x55ecbd1e85b0) at orch.cpp:235
#5  0x000055ecbb8c54c9 in OrchDaemon::start (this=this@entry=0x55ecbd0bfd80) at orchdaemon.cpp:755
#6  0x000055ecbb843fe0 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735
(gdb) c

Having the p2premotevniorch in the orchList does not explain why there was an event. Could you dump the contents of the VXLAN_REMOTE_VNI table before the WB command is issued ? Does it have any entries ?

keys VXLAN_REMOTE_VNI

Thanks for comments. we don't have EVPN configured and "sonic-db-cli APPL_DB keys "VXLAN_REMOTE_VNI*"" is empty.
image

Do you mean if VXLAN_REMOTE_VNI table is not exists, there will no event to trigger EvpnRemoteVnip2pOrch/EvpnRemoteVnip2mpOrch addOperation ? Then here seems we need dig more to see what kind of events from Consumer table right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes addOperation cannot be called unless the VXLAN_REMOTE_VNI table is populated.

My understanding is that warmRestartCheck will not call addOperation.
It just checks the sizes of the m_toSync map and if it is non zero then returns a false.

So one possible explanation is there was an earlier operation which resulted in the VXLAN_REMOTE_VNI being populated and resulting in a failure that you saw.
Under valid EVPN VXLAN configuration one is not expected to see this failure.

WB operation later on results in the readiness check failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, will have some further debug to find the event trigger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srj102 I agree with your comments. However, I think we need to include this work-around for the short term. @jcaiMR please leave the issue open after this PR merged. (remove the fix PR comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, have modified the title and removed Fix #. Will use other PR for the root cause solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srj102 are you ok to remove the blocker on this PR?

@jcaiMR jcaiMR changed the title fix EvpnRemoteVnip2pOrch warmboot check failure [Workaround] EvpnRemoteVnip2pOrch warmboot check failure Feb 7, 2023
@jcaiMR jcaiMR dismissed stale reviews from vaibhavhd and prsunny via cbe9897 February 7, 2023 13:48
@@ -2347,6 +2347,14 @@ bool EvpnRemoteVnip2pOrch::addOperation(const Request& request)
return true;
}

EvpnNvoOrch* evpn_orch = gDirectory.get<EvpnNvoOrch*>();
auto vtep_ptr = evpn_orch->getEVPNVtep();
if (!vtep_ptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow convention. { -> new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update in latest diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@jcaiMR jcaiMR requested a review from prsunny February 9, 2023 02:03
@prsunny prsunny merged commit 0704f78 into sonic-net:master Feb 13, 2023
yxieca pushed a commit that referenced this pull request Feb 21, 2023
AntonHryshchuk added a commit to AntonHryshchuk/sonic-buildimage that referenced this pull request Feb 22, 2023
Update sonic-swss submodule pointer to include the following:
* f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([sonic-net#2511](sonic-net/sonic-swss#2511))

Signed-off-by: AntonHryshchuk <[email protected]>
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Feb 23, 2023
Update sonic-swss submodule pointer to include the following:
* baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed  ([sonic-net#2669](sonic-net/sonic-swss#2669))
* f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([sonic-net#2511](sonic-net/sonic-swss#2511))

Signed-off-by: dprital <[email protected]>
prsunny pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Feb 23, 2023
Update sonic-swss submodule pointer to include the following:
* baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed  ([#2669](sonic-net/sonic-swss#2669))
* f66abed Support for tc-dot1p and tc-dscp qosmap ([#2559](sonic-net/sonic-swss#2559))
* 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([#2512](sonic-net/sonic-swss#2512))
* 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([#2626](sonic-net/sonic-swss#2626))
* 4df5cab [ResponsePublisher] add pipeline support  ([#2511](sonic-net/sonic-swss#2511))
StormLiangMS pushed a commit that referenced this pull request Mar 7, 2023
prsunny pushed a commit that referenced this pull request May 11, 2023
…)" (#2773)

This reverts commit 750e064.
Reverts the PR #2756

The fix added breaks the previously added workaround #2626. Hence requesting to revert the fix.
Once we find a proper solution for sonic-net/sonic-buildimage#12361 we need to reintegrate this PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants