link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

wenyiz2021 · 2025-02-10T23:25:53Z

so far we noticed the link down alert happens when we upgrade the Arista T2 from SONIC 202205 image to 202405 image.

the alert arise because of 2 issues:

when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.
on the Arista T2 running 202405 image, after we restart swss0/1, we see syncd interrupts on all LCs and potentially this is causing the CRC link error on its peer T1s(the interrupts syslog is the suspicious culprit log that matches alert creation time):

<30>2025-01-30T21:14:59.666022+00:00 STG01-0101-0400-01T2-sup00 INFO systemd[1]: Started [email protected] - switch state service.

<13>2025-01-30T21:15:01.666281+00:00 STG01-0101-0400-01T2-sup00 NOTICE root: Started swss0 service...

<14>2025-01-30T21:15:22.349038+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:22.349099+00:00 STG01-0101-0400-01T2-lc05 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0

<14>2025-01-30T21:15:22.749727+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:22.840429+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.122929+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:23.123224+00:00 STG01-0101-0400-01T2-lc04 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0

<14>2025-01-30T21:15:23.173596+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.305868+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.418543+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<14>2025-01-30T21:15:23.565039+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015

<10>2025-01-30T21:15:23.626048+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd0#syncd: [06:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0

<10>2025-01-30T21:15:24.126213+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd1#syncd: [07:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0

arista-nwolfe · 2025-02-13T23:13:46Z

when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.

@wenyiz2021 I thought this was understood and the problem was that the minigraph/config_db wasn't updated to give the fabric cards switchIds?

kenneth-arista · 2025-02-14T07:53:42Z

Regarding L1 CRC errors seen on downstream T1 devices, this was a complete coincidence with bringing up fabric module 0. Any uncorrectable issue with fabric links would lead to drops to packets and such packets would not make to front panel ports to generate CRC errors. The CRC errors are caused by something else.

Regarding the syslogs posted in the description, they are benign as they are generated when fabric chips are initializing and coming online.

kenneth-arista · 2025-02-14T08:39:58Z

There is a possibility the packet corruption (CRC errors) is due to FIFO underflow. If the issue occurs again, the thresholds can be tuned to confirm whether the underflow is the trigger. Broadcom has documented how to tune this threshold and we can help determine a better setting (https://brcmsemiconductor-csm.wolkenservicedesk.com/wolken-support/article?articleId=19817).

Closing this issue. Please continue to monitor and if it occurs again, let's get on a call.

wenyiz2021 · 2025-02-14T23:47:03Z

when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.

@wenyiz2021 I thought this was understood and the problem was that the minigraph/config_db wasn't updated to give the fabric cards switchIds?

thanks, yes, I think thers is PR to fix the switchid @arlakshm

kenneth-arista closed this as completed Feb 14, 2025

kenneth-arista reopened this Feb 14, 2025

kenneth-arista closed this as completed Feb 14, 2025

kenneth-arista reopened this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

wenyiz2021 commented Feb 10, 2025 •

edited

Loading

arista-nwolfe commented Feb 13, 2025 •

edited

Loading

kenneth-arista commented Feb 14, 2025

kenneth-arista commented Feb 14, 2025

wenyiz2021 commented Feb 14, 2025

link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

link down issue on Arista SONIC T2 in staging when switching from 202205 to 202405 #120

Comments

wenyiz2021 commented Feb 10, 2025 • edited Loading

arista-nwolfe commented Feb 13, 2025 • edited Loading

kenneth-arista commented Feb 14, 2025

kenneth-arista commented Feb 14, 2025

wenyiz2021 commented Feb 14, 2025

wenyiz2021 commented Feb 10, 2025 •

edited

Loading

arista-nwolfe commented Feb 13, 2025 •

edited

Loading