-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UT for orchagent watchdog #8306
Conversation
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
tests/system_health/test_watchdog.py
Outdated
duthost = duthosts[enum_rand_one_per_hwsku_hostname] | ||
|
||
result = duthost.shell( | ||
r"docker exec -i swss sh -c 'test -f /usr/bin/supervisor-proc-watchdog-listener && echo exist'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuh-80 can you point me to the watchdog listner code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor here is the listener: sonic-net/sonic-buildimage#14686
I will update the PR later to merge the listener with process exit listener.
**What I did** Improve orch agent: output heartbeat message to systemd. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually validate the heartbeat message works correctly. **Details if related** Another inprogress PR will add watchdog for this heartbeat message: sonic-net/sonic-buildimage#14686 sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (#14686) This PR depends on sonic-net/sonic-swss#2737 merge first. **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
pytest.skip("Skip orchagent watchdog test.") | ||
|
||
# wait watchdog emit alert | ||
WATCHDOG_TIMEOUT = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watchdog will send alert after 60 seconds, so I wait for 60*2 here.
tests/system_health/test_watchdog.py
Outdated
def pause_orchagent(duthost): | ||
# find orchagent pid | ||
pid = duthost.shell( | ||
r"ps -ef | grep orchagent | grep -v grep | awk '{print $2}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
…ave issue. (#15429) Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
system_health/test_watchdog.py::test_orchagent_watchdog[vlab-01] PASSED [100%] Will merge PR. |
**What I did** Improve orch agent: output heartbeat message to systemd. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually validate the heartbeat message works correctly. **Details if related** Another inprogress PR will add watchdog for this heartbeat message: sonic-net/sonic-buildimage#14686 sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (sonic-net#14686) This PR depends on sonic-net/sonic-swss#2737 merge first. **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (sonic-net#15429) Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
### Description of PR Add UT for orchagent watchdog. Summary: SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue. This PR will add new UT to cover the watchdog mechanism. ### Type of change - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [x] Test case(new/improvement) ### Back port request - [ ] 201911 - [ ] 202012 - [ ] 202205 ### Approach #### What is the motivation for this PR? Add new UT to test and protect watchdog mechanism from code change. #### How did you do it? Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert. #### How did you verify/test it? Manually test new UT. Pass PR validation. #### Any platform specific information? No #### Supported testbed topology if it's a new test case? Any ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
### Description of PR Add UT for orchagent watchdog. Summary: SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue. This PR will add new UT to cover the watchdog mechanism. ### Type of change - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [x] Test case(new/improvement) ### Back port request - [ ] 201911 - [ ] 202012 - [ ] 202205 ### Approach #### What is the motivation for this PR? Add new UT to test and protect watchdog mechanism from code change. #### How did you do it? Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert. #### How did you verify/test it? Manually test new UT. Pass PR validation. #### Any platform specific information? No #### Supported testbed topology if it's a new test case? Any ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
Description of PR
Add UT for orchagent watchdog.
Summary:
SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue.
This PR will add new UT to cover the watchdog mechanism.
Type of change
Back port request
Approach
What is the motivation for this PR?
Add new UT to test and protect watchdog mechanism from code change.
How did you do it?
Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert.
How did you verify/test it?
Manually test new UT.
Pass PR validation.
Any platform specific information?
No
Supported testbed topology if it's a new test case?
Any
Documentation