Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MUX toggling issue #1

Merged
merged 2 commits into from
Nov 19, 2021
Merged

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Nov 13, 2021

Description of PR

Summary:
Fixes # (issue)

Fix the issue that randomly fails MUX state toggling.

To be more specific, the issue was when trying to toggle MUX ports from standby to active utilizing cli "config mux mode standby all", some random ports won't be toggled successfully and will remain standby mode. This issue can be easily reproduced when ICMP_responder is off but not when the ports are in healthy state.

The cause is when ICMP_responder is off and ToRs stay in unhealthy state, linkmgrd will probe mux state every 300ms1 which will make mux state enter "wait". While mux state equals "wait", linkmgrd won't switch mux sate to user's desiring state2.

A state machine diagram for this particular scenario can be found under the internal sharepoint folder for Gemini design doc.

Signed-off-by: Jing Zhang [email protected]

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911

Approach

What is the motivation for this PR?

We want to make sure users are able to toggle from standby to active even when icmp responder is not working, without manual retries.

How did you do it?

Cache the MUX mode configuration change, execute the change once MUX exits "wait" state.

How did you verify/test it?

Tested it on virtual dual testbeds, unable to reproduce the issue anymore.

Footnotes

  1. This interval will be multiplied by a backoff factor if the returned state is unknown.

  2. Instead, it will do a mux state probe.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Nov 13, 2021

It appears that Tamer set up some pipelines before, I will check.

@zjswhhh zjswhhh closed this Nov 13, 2021
@zjswhhh zjswhhh reopened this Nov 13, 2021
@lguohan
Copy link
Contributor

lguohan commented Nov 13, 2021

can you add unit test code?

@lguohan
Copy link
Contributor

lguohan commented Nov 13, 2021

the description is unclear, please clearly describe the problem and how you did to address it.

Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need description and unit test code

@lguohan
Copy link
Contributor

lguohan commented Nov 13, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca
Copy link
Contributor

yxieca commented Nov 15, 2021

@zjswhhh the change looks good with some code structure change suggestions. I agree with Guohan, please add unit test.

@vdahiya12
Copy link

@zjswhhh lets also run this change with all DualToR sonic-mgmt tests and see if they pass.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Nov 17, 2021

need description and unit test code

Added accordingly.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Nov 17, 2021

@zjswhhh lets also run this change with all DualToR sonic-mgmt tests and see if they pass.

Will do.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Nov 18, 2021

@zjswhhh lets also run this change with all DualToR sonic-mgmt tests and see if they pass.

Will do.

@vdahiya12 Confirmed that this change wouldn't break any dualtor test that passed on 20201231 release versions. There are 4 test cases fail consistently even on release versions, we can take a look offline if needed.

@lguohan lguohan closed this Nov 18, 2021
@lguohan lguohan reopened this Nov 18, 2021
@vdahiya12
Copy link

@zjswhhh lets also run this change with all DualToR sonic-mgmt tests and see if they pass.

Will do.

@vdahiya12 Confirmed that this change wouldn't break any dualtor test that passed on 20201231 release versions. There are 4 test cases fail consistently even on release versions, we can take a look offline if needed.
Thanks, did we run both dualtor and dualtor_io ?

@lguohan
Copy link
Contributor

lguohan commented Nov 18, 2021

when you merge this pr, please use squash merge and also make sure the commit message is properly formatted and have all the information in your pr description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants