Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xcvrd SfpStateUpdateTask takes long time to shutdown than before #13591

Closed
Junchao-Mellanox opened this issue Feb 2, 2023 · 5 comments
Closed
Assignees
Labels
Triaged this issue has been triaged

Comments

@Junchao-Mellanox
Copy link
Collaborator

Description

The issue is introduced by commit sonic-net/sonic-platform-daemons@153ea47. In this commit, it changed SfpStateUpdateTask from Process to Thread. In order to make shutdown flow fast, I saw the code raises an exception in the shutdown flow. However, raising an exception does not make shutdown flow fast due my test on Nvidia platform.

Steps to reproduce the issue:

  1. Based on your change, change line from
status, port_dict, error_dict = _wrapper_get_transceiver_change_event(timeout)

to

status, port_dict, error_dict = _wrapper_get_transceiver_change_event(60000)
  1. Stop xcvrd by command “supervisorctl stop xcvrd”

Describe the results you received:

I saw xcvrd took 10 seconds to shutdown which means that is was terminated by SIG_KILL.

root@r-leopard-41:/# time supervisorctl stop xcvrd                                                                                                                                                                                          
xcvrd: stopped                                                                                                                                                                                                                               
                                                                                                                                                                                                                                             
real    0m10.292s                                                                                                                                                                                                                            
user    0m0.205s                                                                                                                                                                                                                             
sys     0m0.013s 

Describe the results you expected:

From code I see that there is chance select timeout will be set to 60000.

elif next_state == STATE_NORMAL:
                timeout = STATE_MACHINE_UPDATE_PERIOD_MSECS

xcvrd should shutdown as fast as before

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@Junchao-Mellanox
Copy link
Collaborator Author

Hi @mihirpat1, please check.

@prgeor prgeor added the Triaged this issue has been triaged label Feb 2, 2023
@prgeor
Copy link
Contributor

prgeor commented Feb 2, 2023

Issues is seen currently only on mellanox platform.

@Junchao-Mellanox
Copy link
Collaborator Author

xcvrd is a common sonic component shared by all vendors. Changes on it should work on all platforms.

BTW, I checked Arista platform API implementation. Arista platform did not run into the issue because the timeout value will be no more than 1 second in Arista platform. Please check this code from platform/barefoot/sonic-platform-modules-arista/arista/utils/sonic_platform/event.py:

   def wait(self, timeout=0):
      res = self.get_empty_results()
      block = (timeout == 0)
      detected = False

      while not detected and (timeout > 0 or block):
         begin = time.time()
         interval = self.pollInterval if block else min(timeout, self.pollInterval)

See the last line, the epoll timeout value will never be larger than self.pollInterval. And pollInterval is default 1sec:

class EventWatcher(object):
   def __init__(self, preserve=False, pollInterval=1000.):

So, I would suggest you run another test on Arista: change pollInterval from 1000 to 60000.

@prgeor
Copy link
Contributor

prgeor commented Feb 3, 2023

@Junchao-Mellanox as discussed can you break down the sleep in your platform code to polling every < 1sec? That should fix this issue.

@Junchao-Mellanox
Copy link
Collaborator Author

PR created: #13611

liat-grozovik pushed a commit that referenced this issue Feb 14, 2023
… shutdown (#13611)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue #13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this issue Mar 28, 2023
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Sep 5, 2023
… shutdown (sonic-net#13611)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue sonic-net#13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Sep 5, 2023
… shutdown (sonic-net#13611)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue sonic-net#13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
yxieca pushed a commit that referenced this issue Sep 6, 2023
… shutdown (#13611) (#16449)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue #13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test

Co-authored-by: Junchao-Mellanox <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants