Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

Open
wenyiz2021 opened this issue Feb 10, 2025 · 4 comments

Comments

@wenyiz2021
Copy link

wenyiz2021 commented Feb 10, 2025

so far we noticed the link down alert happens when we upgrade the Arista T2 from SONIC 202205 image to 202405 image.

the alert arise because of 2 issues:

  1. when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.
  2. on the Arista T2 running 202405 image, after we restart swss0/1, we see syncd interrupts on all LCs and potentially this is causing the CRC link error on its peer T1s(the interrupts syslog is the suspicious culprit log that matches alert creation time):

<30>2025-01-30T21:14:59.666022+00:00 STG01-0101-0400-01T2-sup00 INFO systemd[1]: Started [email protected] - switch state service.

<13>2025-01-30T21:15:01.666281+00:00 STG01-0101-0400-01T2-sup00 NOTICE root: Started swss0 service...

<14>2025-01-30T21:15:22.349038+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:22.349099+00:00 STG01-0101-0400-01T2-lc05 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0

<14>2025-01-30T21:15:22.749727+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:22.840429+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.122929+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:23.123224+00:00 STG01-0101-0400-01T2-lc04 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0

<14>2025-01-30T21:15:23.173596+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.305868+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.418543+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.565039+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:23.626048+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd0#syncd: [06:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0

<10>2025-01-30T21:15:24.126213+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd1#syncd: [07:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0

@arista-nwolfe
Copy link

arista-nwolfe commented Feb 13, 2025

when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.

@wenyiz2021 I thought this was understood and the problem was that the minigraph/config_db wasn't updated to give the fabric cards switchIds?

@kenneth-arista
Copy link

Regarding L1 CRC errors seen on downstream T1 devices, this was a complete coincidence with bringing up fabric module 0. Any uncorrectable issue with fabric links would lead to drops to packets and such packets would not make to front panel ports to generate CRC errors. The CRC errors are caused by something else.

Regarding the syslogs posted in the description, they are benign as they are generated when fabric chips are initializing and coming online.

@kenneth-arista
Copy link

There is a possibility the packet corruption (CRC errors) is due to FIFO underflow. If the issue occurs again, the thresholds can be tuned to confirm whether the underflow is the trigger. Broadcom has documented how to tune this threshold and we can help determine a better setting (https://brcmsemiconductor-csm.wolkenservicedesk.com/wolken-support/article?articleId=19817).

Closing this issue. Please continue to monitor and if it occurs again, let's get on a call.

@wenyiz2021
Copy link
Author

when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.

@wenyiz2021 I thought this was understood and the problem was that the minigraph/config_db wasn't updated to give the fabric cards switchIds?

thanks, yes, I think thers is PR to fix the switchid @arlakshm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants