Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes) is seen sometimes during switch initialization #17530

Closed
dgsudharsan opened this issue Dec 16, 2023 · 7 comments
Assignees
Labels
Issue for 202305 MSFT Triaged this issue has been triaged

Comments

@dgsudharsan
Copy link
Collaborator

Description

The below error log occurs sometime during switch initialization

Dec 14 04:12:21.433385 arc-switch1025 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

On analysis, it appears that there is no further details on what orchagent does. It might be that due to other process getting initialized orchagent might not get cycle to send heartbeat. Here are the logs

Dec 14 04:11:29.907268 arc-switch1025 NOTICE swss#orchagent: :- addNextHopGroup: Create next hop group fc00::22@PortChannel1014,fc00::2a@PortChannel1017
Dec 14 04:12:21.433385 arc-switch1025 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
Dec 14 04:12:27.465717 arc-switch1025 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000022 status:1
Dec 14 04:12:27.465927 arc-switch1025 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet50 oper state set from down to up

From swss rec

2023-12-14.02:11:27.399122|ROUTE_TABLE:20c0:ca50:0:80::/64|SET|protocol:bgp|nexthop:fc00::22|ifname:PortChannel1014
2023-12-14.02:11:29.507746|ROUTE_TABLE:20c0:b1c0::/64|SET|protocol:bgp|nexthop:fc00::22,fc00::2a|ifname:PortChannel1014,PortChannel1017
2023-12-14.02:12:27.478798|LAG_TABLE:PortChannel1020|SET|admin_status:up|oper_status:up|mtu:9100

From sairedis rec it appears that orchagent was busy for about 40 seconds performing route installation and setting values but no logs for the last 20 seconds

2023-12-14.02:11:29.922649|C|SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER||oid:0x2d000000000a5b|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000000a5a|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x4000000000a55|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_SEQUENCE_ID=1||oid:0x2d000000000a5c|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000000a5a|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x4000000000a56|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_SEQUENCE_ID=2
2023-12-14.02:11:30.110648|C|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"fc00::8/126","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000002"}|S
2023-12-14.02:11:32.449287|C|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"20c0:b010:0:80::/64","switch_id":"oid:0x21000000000000","vr":"oid:0x300000000
2023-12-14.02:12:03.309022|C|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"192.240.176.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000
2023-12-14.02:12:05.749086|S|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"200.0.1.0/26","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000002"}|
2023-12-14.02:12:27.466635|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000091a|SAI_HOSTIF_ATTR_OPER_STATUS=true

Steps to reproduce the issue:

  1. Perform deploy of the switch
  2. Observe error logs sometime

Describe the results you received:

Error log seen

Describe the results you expected:

No error log should be seen

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_arc-switch1025_20231214_053630.tar.gz

@dgsudharsan
Copy link
Collaborator Author

@liuh-80 Can you please investigate this?

@liuh-80
Copy link
Contributor

liuh-80 commented Dec 23, 2023

It's difficult to identify orchagent stuck caused by orchagent busy or caused by some code issue make orchangent stuck. because in both case orchangent does not send heartbeat message in 1 minutes.

For example, if some code bug in SAI API make route create request take more than 10 minutes, do we want watchdog report this issue or not?

I will create a PR to increase watchdog threshold for this issue.

@dgsudharsan
Copy link
Collaborator Author

@liuh-80 Can you please provide ETA for fix?

@liuh-80
Copy link
Contributor

liuh-80 commented Jan 4, 2024

@liuh-80 Can you please provide ETA for fix?

ETA will be 2024/01/30, I will discussion with Qi if this need fix. because the transient alert will not break anything.

@liuh-80
Copy link
Contributor

liuh-80 commented Jan 22, 2024

Will change from ERR message to Warning message.

@liuh-80
Copy link
Contributor

liuh-80 commented Jan 22, 2024

Draft PR created: #17872

@liuh-80
Copy link
Contributor

liuh-80 commented Jan 26, 2024

Fix PR merged, close this issue.

@liuh-80 liuh-80 closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202305 MSFT Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

3 participants