-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MUX toggling issue #1
Conversation
…iving the config notification
It appears that Tamer set up some pipelines before, I will check. |
can you add unit test code? |
the description is unclear, please clearly describe the problem and how you did to address it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need description and unit test code
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@zjswhhh the change looks good with some code structure change suggestions. I agree with Guohan, please add unit test. |
@zjswhhh lets also run this change with all DualToR sonic-mgmt tests and see if they pass. |
Added accordingly. |
Will do. |
@vdahiya12 Confirmed that this change wouldn't break any dualtor test that passed on 20201231 release versions. There are 4 test cases fail consistently even on release versions, we can take a look offline if needed. |
|
when you merge this pr, please use squash merge and also make sure the commit message is properly formatted and have all the information in your pr description. |
Description of PR
Summary:
Fixes # (issue)
Fix the issue that randomly fails MUX state toggling.
To be more specific, the issue was when trying to toggle MUX ports from standby to active utilizing cli "config mux mode standby all", some random ports won't be toggled successfully and will remain standby mode. This issue can be easily reproduced when ICMP_responder is off but not when the ports are in healthy state.
The cause is when ICMP_responder is off and ToRs stay in unhealthy state, linkmgrd will probe mux state every 300ms1 which will make mux state enter "wait". While mux state equals "wait", linkmgrd won't switch mux sate to user's desiring state2.
A state machine diagram for this particular scenario can be found under the internal sharepoint folder for Gemini design doc.
Signed-off-by: Jing Zhang [email protected]
Type of change
Back port request
Approach
What is the motivation for this PR?
We want to make sure users are able to toggle from standby to active even when icmp responder is not working, without manual retries.
How did you do it?
Cache the MUX mode configuration change, execute the change once MUX exits "wait" state.
How did you verify/test it?
Tested it on virtual dual testbeds, unable to reproduce the issue anymore.
Footnotes
This interval will be multiplied by a backoff factor if the returned state is unknown. ↩
Instead, it will do a mux state probe. ↩