Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port status not reflected in SONiC #4646

Closed
ciju-juniper opened this issue May 26, 2020 · 95 comments
Closed

Port status not reflected in SONiC #4646

ciju-juniper opened this issue May 26, 2020 · 95 comments
Assignees
Labels
brcmsai Bug 🐛 P0 Priority of the issue

Comments

@ciju-juniper
Copy link
Contributor

ciju-juniper commented May 26, 2020

Description

This is an issue observed in the latest SONiC images. When the cables are connected/removed in the switch ports, interface status is not correctly reflected in 'show interfaces status'. Link status is updated correctly in the broadcom side. Changes in the link state are not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. After a system reboot, the link status is updated in the kernel and SONiC is able to report the status correctly.

We tried to narrow down this problem to a specific kernel version, but the SONiC builds are broken when we go back to couple of weeks / months due to certain package versions missing.
The last commit on which we didn't observe this issue was on Feb-26 and there are quite significant changes happened in 'drivers/net' directory of the kernel in the span of 1.5 months.

Please have a look on this issue and suggest how to debug further.

Testing environment

Switch: QFX5200-32C-S
ASIC: TH1
Branch: master

Link is configured with 100G and the DAC cable is connected back to back with the ports in the same switch. No platform specific drivers are loaded apart from ASIC configuration files in 'device' directory.

Here are the logs from problematic image (May 20th Jenkins image)

Trial 1

100G DAC Cable connected to port 0 and port 1 after the box is rebooted.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A
  Ethernet8      57,58,59,60     100G   9100   hundredGigE1/3  routed    down       up     N/A         N/A
  ....

BCMCMD also shows link up in bcm asic for these 2 ports
root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

Also "carrier" parameter is updated in "sys" directory for the ports connected with cables

root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1
Trial 2

100G DAC Cable connected to port 0 and connected to port 31.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A
  ...
  Ethernet120       9,10,11,12     100G   9100  hundredGigE1/31  routed    down       up     N/A         N/A
  Ethernet124      13,14,15,16     100G   9100  hundredGigE1/32  routed    down       up     N/A         N/A

BCMCMD also shows link up in bcm asic for these 2 ports

root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce3
       ce3( 13)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

"carrier" parameter is not updated in "sys" directory for the ports connected with cables, still showing carrier up for port 1

root@sonic:/home/admin# cat /sys/class/net/Ethernet124/carrier
0
root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1

Here is the dump from problematic image:

sonic_dump_sonic_20200526_071154.tar.gz

Last working commit from master branch: 1ef7403

Here are the logs from the working kernel image:

Trial 1

DAC cable is connected between physical port-0 & port-1 and system is rebooted.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A
 
root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1
Trial 2

100G DAC Cable connected to port 0 and connected to port 31.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed    down       up     N/A         N/A
  .....
  Ethernet120    9,10,11,12      100G   9100  hundredGigE1/31  routed    down       up     N/A         N/A
  Ethernet124    13,14,15,16     100G   9100  hundredGigE1/32  routed      up       up     N/A         N/A
  
root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  down   4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
	  
root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
	  
root@sonic:/home/admin# bcmcmd ps | grep ce3
       ce3( 13)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# cat /sys/class/net/Ethernet124/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1

Here is the dump from working image:

sonic_dump_sonic_20200522_143911.tar.gz

@ciju-juniper
Copy link
Contributor Author

@lguohan @jleveque Please have a look and let me know how to proceed further. We are trying on 201911 branch as well. Will update the test results shortly.

@ciju-juniper
Copy link
Contributor Author

In 201911 branch, issue is not seen.

@rlhui
Copy link
Contributor

rlhui commented May 27, 2020

@ciju-juniper, it's a little bit not clear what the issue is still. Please make it more clear what is not expected, and make the formatting more clear. Thanks.

@ciju-juniper
Copy link
Contributor Author

@rlhui When the cables are connected/removed in the switch ports, the link status is not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. Due to this problem, 'show interfaces status' is not reflecting the link status.

@lguohan
Copy link
Collaborator

lguohan commented May 27, 2020

@ciju-juniper , can you provide the dump file after you did trail 2 for the bad image. I need to look at the sairedis log to see if port status notification is generated or not. In your current dump, it seems it is for trail 1.

as you can see, it only has up notification. when you unplug Ethernet4, we should receive a port down notification from SAI, which I cannot find in the sairedis log.

lgh@gulv-vm3:~/sonic_dump_sonic_20200526_071154/log$ grep OPER_STATUS sairedis.rec 
2020-05-26.06:53:11.592558|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-26.06:53:11.592925|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ee|SAI_HOSTIF_ATTR_OPER_STATUS=true
2020-05-26.06:53:11.595581|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-26.06:53:11.595802|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ef|SAI_HOSTIF_ATTR_OPER_STATUS=true

@ciju-juniper
Copy link
Contributor Author

@lguohan The dump was captured after Trial-2. As you can see '/sys/class/net/Ethernet4/carrier' still reports '1' even after the cable pull out.

We can try recreating the problem and capture the dump again. Needs a day or two as we don't have physical access to the box these days.

Could you tell me which driver in the kernel find out the link status change and update the '/sys/class/net/Ethernet4/carrier'?

@lguohan
Copy link
Collaborator

lguohan commented May 27, 2020

the mismatch here is that if you look at the broadcom.ps file, it says ce12 and ce13 up. In the trail-2, i think it should be ce3 to be up according to your good trail log. So, it seems after you swap the cable, ce3 is not up from broadcom.ps file.

that is why I am not sure if you have provide the right capture after trail-2.

      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No      
      xe36( 51)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      xe37( 52)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      xe38( 53)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No      

ethernet4 is the broadcom knet driver. when there is link down, sai will send a notification to orchagent, and orchagent will call another SAI api to set the carrier down. From your log, i do not see the sai send the notification.

@ciju-juniper
Copy link
Contributor Author

@lguohan Please see the attached test logs and the dump from the bad image.
Test_logs_28_05_2020.txt
sonic_dump_sonic_20200528_055104.tar.gz

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented May 28, 2020

@lguohan Look like there are no SAI notifications after Trial-2

crajank-mbp:log crajank$ grep -i "SAI_PORT_OPER_STATUS_" sairedis.rec 
2020-05-28.05:38:10.297727|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:38:10.300046|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:40:23.912578|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-05-28.05:40:23.914670|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-05-28.05:40:28.943032|n|port_state_change|[{"port_id":"oid:0x1000000000005","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:40:28.945893|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

@lguohan
Copy link
Collaborator

lguohan commented May 28, 2020

so, it looks like a broadcom SAI issue. if you bring down the port on the other side, will the SAI generate the notification?

@ciju-juniper
Copy link
Contributor Author

@lguohan We had tested by connecting DAC cables in loopback within the same box.

Are you asking to do an 'ifconfig Ethernet0 down' in this loopback setup? Or connect the cables between two boxes and check?

@lguohan
Copy link
Collaborator

lguohan commented May 28, 2020

basically, I am asking to simulate the port oper status down and check if sai sends a notification to upper layer or not.

@BaluAlluru
Copy link

We tried to simulate the port oper status down by executing below command to make the port 1 down
bcmcmd "port ce13 Enable=False".

We are seeing the issue in latest SONIC image. Attached is jenkins290_logs file. This file has the logs captured for latest SONIC Jenkins 290 image on the box. In the attached log, the issue is seen after Trial 5. SAI is not sending notification to upper layer after few iterations.
Repeated multiple times the same exercise of disabling the port and enabling the port
and checking the SAI REDIS logs. Port State change event logs are not seen in SAI REDIS.

We also loaded "201911" release image on the box and repeated the same exercise. We don't see issue with this image. In this image, we see SAI sends notification to upper layer. we did many iterations of simulating the link status up/down and didn't observe the issue.
201911_logs attached file has logs for reference.

201911_logs.txt
jenkins290_logs.txt

@ciju-juniper
Copy link
Contributor Author

@lguohan What could be the next steps to debug this issue further?

@ciju-juniper
Copy link
Contributor Author

@lguohan How do we take this issue forward? Could we talk to BRCM to see what is going wrong with SAI?

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Jun 17, 2020

@rlhui I see a mail from you regarding SAI 1.6.2 release. Would you know if this problem is fixed in that?

@ciju-juniper
Copy link
Contributor Author

@smaheshm I see your commit for SAI 1.6.1. Are you aware of this issue?

@smaheshm
Copy link
Contributor

@smaheshm I see your commit for SAI 1.6.1. Are you aware of this issue?

No, not aware of this issue. Let me check the status of 1.6.2 release. Will open a case with BRCM if required.

@ciju-juniper
Copy link
Contributor Author

@smaheshm We verified this issue with SAI 1.6.1. Problem is still seen. Please open a case with BRCM. This is very easy to re-create and very basic.

@smaheshm
Copy link
Contributor

@smaheshm We verified this issue with SAI 1.6.1. Problem is still seen. Please open a case with BRCM. This is very easy to re-create and very basic.

@ciju-juniper Stay tuned for updated BRCM-SAI debian package. We have to check if the issue persists in the updated debian package and then open a case.

@rlhui
Copy link
Contributor

rlhui commented Jun 23, 2020

@ciju-juniper, would you please confirm/clarify if this issue is seen ONLY with cables connecting ports within the same device? Have you tried the same step, but with cables connecting ports of two separate boxes? Thanks.

@ciju-juniper
Copy link
Contributor Author

@rlhui We had tried with a different switch at the other end. Issue is seen when the remote is down or the cable is pulled out from the other end. Port status is not changed in SONiC and no SAI events detected.

@gechiang
Copy link
Collaborator

@ciju-juniper I have tried to reproduce this same issue using master.320 image and flapped the link over 30 times but not able to hit the same issue you reported. May I ask for your help on the following:

  1. If possible, please use the latest master branch image (320 for example) and try to reproduce this issue. If still reproducible, please gather the following information for me:
    a). drop into BCM shell by executing the following cmd: bcmsh
    b). Issue the cmd "bsv" to collect the SAI version output.
    c). Issue the cmd "show unit" to collect the ASIC version output.
    d). Issue the cmd "ver" to collect the BRCM SDK version output.

  2. Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists.

  3. Go back to the 201911 image that you used and collect the SONiC image version by doing "show version".

  4. Drop down to BCM shell and collect the same info as I stated above under a) through d).

Please perform the above and collect the info that I requested so that I can see if I can reproduce it and work with BRCM SAI team to solve this issue.
Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang Problem is seen on master.320 image. We had tried to recreate the issue with different boxes and different cables. Same behaviour.

Our test is very simple. Initially 100G DAC cable was connected to Ethernet8 & Ethernet12.

root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_"
2019-02-14.10:12:43.713628|n|port_state_change|[{"port_id":"oid:0x1000000000016","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2019-02-14.10:12:43.716446|n|port_state_change|[{"port_id":"oid:0x1000000000018","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

After that cable was moved to Ethernet16. Now it's connected between Ethernet8 & Ethernet16.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU    FEC          Alias    Vlan    Oper    Admin             Type    Asym PFC
-----------  ---------------  -------  -----  -----  -------------  ------  ------  -------  ---------------  ----------
  Ethernet0      73,74,75,76     100G   9100    N/A   hundredGigE1  routed    down       up              N/A         N/A
  Ethernet4      65,66,67,68     100G   9100    N/A   hundredGigE2  routed    down       up              N/A         N/A
  Ethernet8      81,82,83,84     100G   9100    N/A   hundredGigE3  routed      up       up  QSFP28 or later         N/A
 Ethernet12      89,90,91,92     100G   9100    N/A   hundredGigE4  routed      up       up              N/A         N/A
 Ethernet16  105,106,107,108     100G   9100    N/A   hundredGigE5  routed    down       up  QSFP28 or later         N/A

There are no SAI event messages for port down and up events. 'show interface status' still shows ports Ethernet8 & Ethernet12 are up, even though there is no cable connections.

We can see that ports (Ethernet8 & Ethernet16) are up in BCM shell

ce20( 38)  up     4  100G  FD   SW  No   Forward          None    F  CAUI4  9122    No
ce26( 44)  up     4  100G  FD   SW  No   Forward          None    F  CAUI4  9122    No

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Jun 25, 2020

@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH2 based platform

admin@sonic:~$ show version

SONiC Software Version: SONiC.master.320-0d809d0d
Distribution: Debian 10.4
Kernel: 4.19.0-6-2-amd64
Build commit: 0d809d0d
Build date: Tue Jun 23 19:03:25 UTC 2020
Built by: johnar@jenkins-worker-4

Platform: x86_64-juniper_qfx5210-r0
HwSKU: Juniper-QFX5210-64C
ASIC: broadcom
Serial Number: YJ0219370007
Uptime: 11:06:03 up 54 min,  2 users,  load average: 1.93, 1.82, 1.73

1.b). Issue the cmd "bsv" to collect the SAI version output.

drivshell>bsv
bsv
BRCM SAI ver: [3.7.4.2], OCP SAI ver: [1.6.0], SDK ver: [6.5.16]

1.c). Issue the cmd "show unit" to collect the ASIC version output.

drivshell>show unit
show unit
Unit 0 chip BCM56970_B0 (current)

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

drivshell>ver
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200604 (Thu Jun  4 18:14:20 2020)
From sonicbld@eaa190965a84:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.4.2/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:
      
      
      
      
      
      
      
      
       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,
      
      
      
      
       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,
      
PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists.
[Ciju] We had tried to recreate the issue with different boxes and different cables. Same behaviour. Problem is seen.

@ciju-juniper
Copy link
Contributor Author

@gechiang Here are the o/p that you have asked with 201911 based image. This is from a TH2 based platform

root@sonic:/home/admin# show version
 
SONiC Software Version: SONiC.201911.0-dirty-20200611.093730
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: d9848197
Build date: Thu Jun 11 02:17:54 UTC 2020
Built by: ciju@sonic-server

1.b). Issue the cmd "bsv" to collect the SAI version output.

root@sonic:/home/admin# bcmcmd bsv
bsv
BRCM SAI ver: [3.7.3.3], OCP SAI ver: [1.5], SDK ver: [6.5.16]
drivshell>

1.c). Issue the cmd "show unit" to collect the ASIC version output.

root@sonic:/home/admin# bcmcmd "show unit"
show unit
Unit 0 chip BCM56970_B0 (current)
drivshell>
root@sonic:/home/admin#

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

root@sonic:/home/admin# bcmcmd "ver"
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200417 (Fri Apr 17 02:10:18 2020)
From sonicbld@9cacac0fd10c:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:
 
 
 
 
 
 
 
 
       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,
 
 
 
 
       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,
 
PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists.
[Ciju] We had tried to recreate the issue with different boxes and different cables. Same behaviour. Problem is seen.

@ciju-juniper
Copy link
Contributor Author

ciju-juniper commented Jun 25, 2020

@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH1 based platform. Issue is seen on TH2 platform as well.

root@sonic:~# show version

SONiC Software Version: SONiC.master.320-0d809d0d
Distribution: Debian 10.4
Kernel: 4.19.0-6-2-amd64
Build commit: 0d809d0d
Build date: Tue Jun 23 19:03:25 UTC 2020
Built by: johnar@jenkins-worker-4

Platform: x86_64-juniper_qfx5200-r0
HwSKU: Juniper-QFX5200-32C-S
ASIC: broadcom
Serial Number: WD0218170442
Uptime: 07:49:29 up  1:17,  3 users,  load average: 1.78, 1.91, 1.81

1.b). Issue the cmd "bsv" to collect the SAI version output.

drivshell>bsv
bsv
BRCM SAI ver: [3.7.4.2], OCP SAI ver: [1.6.0], SDK ver: [6.5.16]

1.c). Issue the cmd "show unit" to collect the ASIC version output.

drivshell>show unit
show unit
Unit 0 chip BCM56960_B1 (current)

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

drivshell>ver
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200604 (Thu Jun  4 18:14:20 2020)
From sonicbld@eaa190965a84:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.4.2/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:
      
      
      
      
      
      
      
      
       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,
      
      
      
      
       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,
      
PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

@gechiang
Copy link
Collaborator

@ciju-juniper Thank you for providing the detail version information. I was able to reproduce the same issue with our lab DUT that uses the TH1 (BCM56960_B1 ) chip. I have gathered the necessary information and have contacted BRCM to investigate this issue. Will update this thread when I have more information.
Thanks!

@gechiang
Copy link
Collaborator

gechiang commented Jul 7, 2020

@BaluAlluru Here is my investigation response to your TH1 test result that I read from sonic_dump_sonic_20200707_183916.tar.gz:
According to the syslog there was a problem with xcvrd at the following timestamp:

Jul 7 18:24:44.235976 sonic INFO pmon#supervisord: xcvrd Process Process-1:
Jul 7 18:24:44.235976 sonic INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Jul 7 18:24:44.236731 sonic INFO pmon#supervisord: xcvrd File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Jul 7 18:24:44.236731 sonic INFO pmon#supervisord: xcvrd self.run()
Jul 7 18:24:44.236731 sonic INFO pmon#supervisord: xcvrd File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Jul 7 18:24:44.236731 sonic INFO pmon#supervisord: xcvrd self._target(*self._args, **self._kwargs)
Jul 7 18:24:44.237275 sonic INFO pmon#supervisord: xcvrd File "/usr/bin/xcvrd", line 861, in task_worker
Jul 7 18:24:44.237785 sonic INFO pmon#supervisord: xcvrd status, port_dict = _wrapper_get_transceiver_change_event(timeout)
Jul 7 18:24:44.237785 sonic INFO pmon#supervisord: xcvrd TypeError: 'NoneType' object is not iterable

One of the xcvrd thread defunct (crashed) right after you unplugged from port 1 (Ethernet4) to move that connection to port 2 (Ethernet8).

root 4114 0.2 0.1 152960 21704 pts/0 Sl 18:22 0:02 /usr/bin/python /usr/bin/xcvrd
root 4213 0.0 0.0 0 0 pts/0 Z 18:22 0:00 [xcvrd]

After this occurred, all subsequent link bounce activities started to show problem where no more port state notification from SAI to SONiC application. Not sure if this xcvrd/platform driver issue may have some how contributed to SAI not behaving properly.
Spoke to one of our team member that is familiar with xcvrd and according to him this seems to be a platform driver issue where the platform API "platform_chassis.get_change_event()" is not returning anything which caused that code to crash. You may want to look into why "platform_chassis.get_change_event()" is having issue. I think xcvrd script could be improved to catch this and not crash. But according to my team member the platform driver response was not expected.

Also, in the syslog I see that there were lots of "Got sfp removed events". It seems that for some reason the platform driver may be glitching and causing many SFP removed events where there is maybe just one unplug/plug from port 1 to connect to port 2?
...
Jul 7 18:24:14.182210 sonic INFO pmon#xcvrd: Got SFP removed event
Jul 7 18:24:14.182571 sonic INFO pmon#xcvrd: receive plug out and pdate port sfp status table.
Jul 7 18:24:14.182909 sonic INFO pmon#xcvrd: Got SFP removed event
Jul 7 18:24:14.183152 sonic INFO pmon#xcvrd: receive plug out and pdate port sfp status table.
Jul 7 18:24:14.183571 sonic INFO pmon#xcvrd: Got SFP removed event
...

Also there is another event from the syslog that shows some issue:
Jul 7 18:24:44.240076 sonic INFO pmon#supervisord: xcvrd Error, file not exist /sys/class/i2c-adapter/i2c-16/16-0050/eeprom
Not sure why this file was expected to be present but seems to be missing from your platform... Something perhaps you can help take a look from platform driver point of view?

@gechiang
Copy link
Collaborator

gechiang commented Jul 7, 2020

@BaluAlluru Thanks for experimenting with the shut/startup experiment on Ethernet4.
I examined the syslog you captured and it seems that the "startup" cmd was received by Orchagent and applied to ASIC DB correctly. Somehow SAI is not handling that admin up request from SONiC application....

According to SAI dump:
SAI_OBJECT_TYPE_PORT oid:0x100000000000f ^M
SAI_PORT_ATTR_ADMIN_STATE : false^M
SAI_PORT_ATTR_MTU : 9122^M
SAI_PORT_ATTR_SPEED : 100000^M

But according to sairedis rec the cmd was given to SAI:
2020-07-07.23:53:22.847002|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_ADMIN_STATE=true

But for some reason SAI did not handle this ADMIN UP request properly and resulted to what you observed...

In the other trial where you indicated it worked fine Ialso took a look and it seems that the admin shut/up were all handled by SAI correctly and thus no issue:

cat sairedis.rec | grep -i oid:0x100000000000f
2020-07-08.01:07:37.871294|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_HW_LANE_LIST=8:0,0,0,0,0,0,0,0
2020-07-08.01:07:38.179600|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_NUMBER_OF_INGRESS_PRIORITY_GROUPS=0
2020-07-08.01:07:38.181854|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_INGRESS_PRIORITY_GROUP_LIST=8:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2020-07-08.01:07:38.184358|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_QOS_NUMBER_OF_QUEUES=8
2020-07-08.01:07:38.185363|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_QOS_QUEUE_LIST=20:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2020-07-08.01:07:38.188293|c|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ef|SAI_HOSTIF_ATTR_TYPE=SAI_HOSTIF_TYPE_NETDEV|SAI_HOSTIF_ATTR_OBJ_ID=oid:0x100000000000f|SAI_HOSTIF_ATTR_NAME=Ethernet4
2020-07-08.01:07:38.188868|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:07:38.191325|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_SPEED=0
2020-07-08.01:07:38.389049|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_MTU=9122
2020-07-08.01:07:38.389158|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:07:38.877157|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-08.01:07:41.469457|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x600000000064a|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000024|SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS=0C:59:9C:81:E6:9E|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_PORT|SAI_ROUTER_INTERFACE_ATTR_PORT_ID=oid:0x100000000000f|SAI_ROUTER_INTERFACE_ATTR_MTU=9100|SAI_ROUTER_INTERFACE_ATTR_NAT_ZONE_ID=0
2020-07-08.01:08:51.878593|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_ADMIN_STATE=false
2020-07-08.01:08:51.925101|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-08.01:09:27.602920|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000f|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:09:28.109865|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-08.01:10:12.417737|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-08.01:10:33.033970|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

cat sairedis.rec | grep -i "oid:0x100000000000e"
2020-07-08.01:07:37.870678|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_HW_LANE_LIST=8:0,0,0,0,0,0,0,0
2020-07-08.01:07:38.169030|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_NUMBER_OF_INGRESS_PRIORITY_GROUPS=0
2020-07-08.01:07:38.169958|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_INGRESS_PRIORITY_GROUP_LIST=8:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2020-07-08.01:07:38.171117|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_QOS_NUMBER_OF_QUEUES=8
2020-07-08.01:07:38.171625|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_QOS_QUEUE_LIST=20:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2020-07-08.01:07:38.174616|c|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ee|SAI_HOSTIF_ATTR_TYPE=SAI_HOSTIF_TYPE_NETDEV|SAI_HOSTIF_ATTR_OBJ_ID=oid:0x100000000000e|SAI_HOSTIF_ATTR_NAME=Ethernet0
2020-07-08.01:07:38.175008|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:07:38.177831|g|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_SPEED=0
2020-07-08.01:07:38.384993|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_MTU=9122
2020-07-08.01:07:38.385087|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:07:38.874941|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-08.01:07:41.326659|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x600000000063b|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000024|SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS=0C:59:9C:81:E6:9E|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_PORT|SAI_ROUTER_INTERFACE_ATTR_PORT_ID=oid:0x100000000000e|SAI_ROUTER_INTERFACE_ATTR_MTU=9100|SAI_ROUTER_INTERFACE_ATTR_NAT_ZONE_ID=0
2020-07-08.01:08:51.923559|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-08.01:09:28.106632|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-08.01:10:12.321573|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_ADMIN_STATE=false
2020-07-08.01:10:12.415092|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-08.01:10:32.496645|s|SAI_OBJECT_TYPE_PORT:oid:0x100000000000e|SAI_PORT_ATTR_ADMIN_STATE=true
2020-07-08.01:10:33.031686|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

I also agree with you that it is inconclusive due to the one failure you observed. But it did appears that if you don't manually move the cable it is definitely behaving better that with cable unplug/plug handling...

In any case, I have already opened a CSP against BRCM and provided the dumps for them to investigate. Let's see what they find... In case they want to try out something and since I am not able to reproduce it here locally, I may have to come to you to get that for them...
Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang Thanks!

We will investigate the xcvrd issue. Look like there were some recent changes went into xcvrd. After that we will again experiment with TH1 platform.

Please let us know if broadcom requests for any additional data from TH2 platform.

@lguohan @gechiang There is an orchagent crash in the latest images #4907 There is a bcm_knet driver crash and BCM asic initialization has failed. Look like a problem with bcm/SAI.

@gechiang
Copy link
Collaborator

@ciju-juniper @BaluAlluru BRCM has requested another repro with "link+ " enabled to help capture more debug information so that they can debug this issue. Can you please use master.345 image (the latest of master branch) and once your setup boots up, go to BCM shell and issue the following cmd:
bcmsh

drivshell>debug link +

Ctrl-c

After this, then try your steps to move cable to reproduce the issue.
Once issue reproduced, capture the "show tech" and send attach it here so that I can send it to BRCM for analysis.

BTW, I am really suspecting that the platform drive that caused the xcvrd issue may have somehow degraded the SAI behavior in someway... so appreciate if you can work from that side to see if there are new progress can be made. Since you do not have this issue running 20191130 based image. Is your platform drive different in 20191130 compared with master branch based version? If they are different, is it possible that you can use the platform driver that is from 20191130 while running the master based image if this is not too difficult for you? This will help eliminate the platform driver as a potential suspect of the root cause...
Thanks!

@BaluAlluru
Copy link

BaluAlluru commented Jul 12, 2020

@gechiang,
We tested on Jenkins 338 image on TH1 box.
xcvrd issue is resolved in this image. However port status issue still seen in this image.
Attached are the logs and "show techsupport" for reference.
QFX5200_Jenkins338_logs.txt
sonic_dump_sonic_20200710_174513.tar.gz

Trial 1 from "QFX5200_Jenkins338_logs" is captured right after box is loaded with this image and 100G DAC cables are connected between port 0 and port 1.
We see many SAI redis messages for port 0 and port 1 where 100G DAC cable is connected.
In earlier images we didn't see these many SAI redis messages.

root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_"
2020-07-10.17:28:57.658748|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.661105|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.702412|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.704530|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.712760|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.714952|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.718733|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.720740|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.723837|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.726007|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.730443|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.732424|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.735494|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.737648|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.741634|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.743574|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-07-10.17:28:57.746427|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-07-10.17:28:57.748640|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
root@sonic:/home/admin#

As suggested by you, we will test on master.345 with broadcom ASIC debugs enabled and share the logs.

@BaluAlluru
Copy link

BaluAlluru commented Jul 13, 2020

@gechiang, We tested Jenkins 345 image on TH1 platform.
Port status issue observed with this image.

OIR sequences performed on TH1 platform.
Switch is rebooted by having Ethernet0 - Ethernet4 ports connected.
Cable moved to Ethernet0 - Ethernet8 ==> Issue observed
Cable moved to Ethernet0 - Ethernet40 ==> Issue observed
Cable moved to Ethernet0 - Ethernet124 ==> Issue observed
Cable moved to Ethernet0 - Ethernet4 ==> Issue observed

There is no xcvrd crash with this image.
We enabled "bcmcmd "debug link +" " and collected the dump.
Attached are the logs and dump.
TH1-Jenkins345-logs.txt
sonic_dump_sonic_20200713_144352.tar.gz

@gechiang
Copy link
Collaborator

gechiang commented Jul 13, 2020

@BaluAlluru Thanks for trying out the new image and set the new BCM debug cmd to capture the show tech dump file.
I noticed that in syslog1 although xcvrd did not crash it is still not behaving correctly. As soon as it started, it continuously detecting SFP insert and removal. the entire syslog is filled with xcvrd activities.
I do see that the "debug link +" is causing syncd to report all the link events from the SDK lower layer. But yet no SAI state change being reported.
I also see that once in this problem state orchagent seems to be stuck on
Jul 13 14:35:29.995292 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
This is also the common symptom on all previous collected show tech support that you have posted to this issue.
I did some searching and it seems that there was a previous issue that also showed this symptom where syncd ended up unable to move due to some mutex issue:
#4347
Perhaps similar issue resurfaced again??
I will submit your new collected information to BRCM and ask them to take a look.
Thanks!

@BaluAlluru
Copy link

@gechiang, Today we tested Jenkins 345 image on TH2 platform.
Port status issue observed with this image.

OIR sequences performed on TH2 platform.
Switch is rebooted by having Ethernet8 - Ethernet12 ports connected.
Cable moved to Ethernet8 - Ethernet16 ==> Issue observed
Cable moved to Ethernet8 - Ethernet40 ==> Issue observed
Cable moved to Ethernet8 - Ethernet252 ==> Issue observed
Cable moved to Ethernet0 - Ethernet4 ==> Issue observed

There is no xcvrd crash with this image.
We enabled "bcmcmd "debug link +" " and collected the dump.
Attached are the logs and dump.
TH2-Jenkins345.txt
sonic_dump_sonic_20190214_102834.tar.gz

@gechiang
Copy link
Collaborator

@BaluAlluru BRCM is requesting for a repro with SAI log level setting change. Unfortunately the swssloglevel setting for SAI components are currently broken in master branch. See the following issue I raised for more details:

https://github.com/Azure/sonic-swss/issues/1348

We will wait for 1348 issue to be resolved and merged into master branch and then let's capture the info needed by BRCM to debug this issue further.
Thanks!

@gechiang
Copy link
Collaborator

@BaluAlluru @ciju-juniper
We finally have a successful build image in master branch 368 which included the SAI Log level setting fix.
I ran a quick try on my BRCM based DUT to set all the SAI components to DEBUG log level and tried out link flaps to see what info were being captured. To my disappointment I did not see any additional info provided by the log level setting. I am in touch with the BRCM team trying to clarify if what I observed is correct... Chances are we may have to hook you up with BRCM directly via a web-x session to have them debug your box directly...
Let me know If you are open with web-x live debug with your setup. Once I confirmed with BRCM, we will need to agree on a set of potential web-x live debug date/time windows that would work for all involved parties. If you already know your schedule for the first 2 weeks of August, please let me know which date/time slots you are open so I can bridge this together with BRCM support.
Thanks!

@gechiang
Copy link
Collaborator

@BaluAlluru @ciju-juniper BRCM team is available to perform live debug on your failing DUT for the followingtwo time slots next week:
Mon (8/3) 2-4 PM or Wed (8/5) 1-3PM.
Let me know if one of these time slots works for you.
If None works, please suggest a few time slots for the week of 8/10 that works for you and I check which one they can attend.
Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang We work out of Bangalore, India. Due to ongoing pandemic issues, we are working remotely. So we need to talk to our LAB admins and get a time-slot during our day time. We will get back to you on this.

Where is the BRCM team located?

@ciju-juniper
Copy link
Contributor Author

@gechiang This is my mail id for scheduling meeting crajank[at]juniper[dot]net

@bingwang-ms bingwang-ms unpinned this issue Aug 20, 2020
@gechiang
Copy link
Collaborator

gechiang commented Sep 4, 2020

@ciju-juniper I understand that your lab was impacted by COVID19 few weeks ago and thus not able to provide the information that we have requested via offline email discussion. The request from BRCM/me was to load a master branch based gdb image and set up two GDB breakpoints that we provided via the email so that you can help to see during your trigger which one of those break points get hit. This information will help us narrow down further to which specific area the issue may reside.
As soon as we get the information, we can re-engage with BRCM team to investigate as needed.
Thanks!

@BaluAlluru
Copy link

@gechiang
Today we tested Jenkins 404 image on TH1 and TH2 platform.
We didn't observe port status issue.

In the first instance, loaded Jenkins 404 image on TH1 platform, connected 100G DAC cable back to back in different ports on TH1 platform and tested for OIR.
Didn't observe the port status issue.
Below are the logs captured.
Jenkins404-TH1-OIR-logs.txt

In the 2nd Instance,loaded Jenkins 404 image on TH1 platform and TH2 platform, connected 100G DAC cable between TH1 platform ports and TH2 platform ports.
Didn't observe the port status issue.
Below are the logs captured.
Jenkins404-TH1-TH2-OIR-logs.txt

@gechiang
Copy link
Collaborator

gechiang commented Sep 15, 2020

@BaluAlluru Thanks so much for getting back to working on this again. I have analyzed both logs that you captured. On surface although the end result of the link status seems correct, but there are some transitions that is not being detected correctly based on what you captured. The one on TH1-OIR only looks fine. But I do have some questions about the TH1-TH2-OIR test result.

For the TH1-TH2-OIR test result I noticed that when you moved the cable on one platform's port to a different port (let's take the first transition port 0 to port 1 on TH2 case), the output shows TH1 did not experience a link flap while TH2 did experience a link flap. I don't think this is the correct behavior. Based on your captured output it seems to me that the link bounce detection is based on physical SFP unplug/plug instead of true link bounce (laser off/on) detection. But this may have to do with how you moved this cable. Can you describe in more details on how the cable move is performed? Is it manually (physically unplug from one port and then inserted to a different port) or it is though some automated patch panel where the signal is rerouted from one port to another port? If it is manually, then we need to chase down why on TH1 when the cable is not removed but the other end (TH2) moved it is not able to detect the link bounce that is happening. If it is the automated patch panel case, then it may be related how this patch panel signal routing may have contributed to this behavior and it would be out of our scope to debug this further. Can you please clarify on this?
Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang OIR was done manually. We will try again that scenario and try to get the gdb attached.

@gechiang
Copy link
Collaborator

@ciju-juniper Thanks for confirming that it is "manually" unplug/plug of the cable being performed.
@BaluAlluru When you retry this scenario again, please see if you can leave the cable disconnected a little longer and capture the swss/sairedis.rec to ensure you do observe the expected "DOWN" event from both TH1 and TH2 DUTs before you plug the cable to the new destination port. If you never see the link DOWN event from any of these DUTs, it is a problem that needs to be debugged and no need to go any further. One thing that I am suspecting is perhaps the "link de-bounce" logic in BRCM SAI may be the issue here... so leaving the cable unplugged should eventually result to detection of the "DOWN" event. If this "DOWN" event never arrives, please reset both set up and connect the gdb breakpoints and repeat the "cable unplug" test again and let us know which breakpoint is missing... Please mind that once the breakpoint is hit, hit the "c" to continue to prevent the process from crashing due to stuck at breakpoint causing queue full... etc...
Thanks!

@BaluAlluru
Copy link

@gechiang
As suggested by you in the previous update, we did the following exercise.

  1. Initially 100G DAC cable is connected between Port 0 of TH1 and Port 0 of TH2 box.
  2. We moved the cable to different ports in TH2 box,keeping cable connected to port 0 of TH1 box.
  3. We left the cable disconnected for little longer before we moved the cable from one port to
    other in TH2 box and collected the logs in this state also.

Attached are the updated logs collected today.
Updated-Jenkins404-TH1-TH2-OIR-logs.txt

Please check Trial 4 in the attached logs.
Below is the snippet of Trial 4, we think highlighted logs are not expected.

root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_"
2020-09-16.14:10:14.002885|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-09-16.14:11:58.903073|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-09-16.14:14:17.392946|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-09-16.14:15:42.905382|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-09-16.14:20:30.111225|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-09-16.14:20:30.113532|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-09-16.14:20:35.152147|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-09-16.14:20:35.154622|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|

@gechiang
Copy link
Collaborator

@BaluAlluru Thanks a lot for trying out the suggested triage steps. Indeed this looks like a problem in BRCM SAI of handling link de-bounce. there seems to be 2 issues that we need some answer from BRCM SAI team based on your new test result:

  1. It seems that if the link is unplugged and plugged back quick enough, there is a chance that the link bounce event may go undetected.
  2. In your new test at trial 4 it seems that it actually reported the wrong port flapped when the other end is unplugged. But good thing is that the final steady state of the link(s) reported was correct. Sometime when moving the cable from one end it may caused some movement that could be detected as multiple link bounce events but it should never be reporting on a wrong port.

Both of these issues appears to be on the TH1 device ONLY. TH2 behaved correctly on all 3 logs you captured.
I will open a new case with BRCM to see if this is a known issue.

One more favor to ask. Can you capture the SAI version and the ASIC version of your TH1 device in BRCM shell?
a). drop into BCM shell by executing the following cmd: bcmsh
b). Issue the cmd "bsv" to collect the SAI version output.
c). Issue the cmd "show unit" to collect the ASIC version output.
d). Issue the cmd "ver" to collect the BRCM SDK version output.

With these additional info I will be able to open a new case for this.

BTW, I don't think this current issue should block your testing progress for now... You will have to perform the link test with cable out for a longer time when dealing with TH1 device for now and the end result should be correct.

@BaluAlluru
Copy link

@gechiang
Below are the command outputs requested by you on TH1 box.

root@sonic:/home/admin# bcmcmd bsv
bsv
BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.6.3], SDK ver: [6.5.16]
drivshell>

root@sonic:/home/admin# bcmcmd "show unit"
show unit
Unit 0 chip BCM56960_B1 (current)
drivshell>

root@sonic:/home/admin# bcmcmd "ver"
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200807 (Fri Aug 7 00:37:22 2020)
From sonicbld@b39d6afedeeb:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.5.1/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

   BCM56640_A0,
   BCM56850_A0,
   BCM56340_A0,
   BCM56960_A0, BCM56860_A0,

   BCM56970_A0, BCM56870_A0,
   BCM56980_A0, BCM56980_B0,

PHYs: BCM5400, BCM54182, BCM54185, BCM54180,
BCM54140, BCM54192, BCM54195, BCM54190,
BCM54194, BCM54210, BCM54220, BCM54280,
BCM54282, BCM54240, BCM54285, BCM5428X,
BCM54290, BCM54292, BCM54294, BCM54295,
BCM54296, BCM8750, BCM8752, BCM8754,
BCM84740, BCM84164, BCM84758, BCM84780,
BCM84784, BCM84318, BCM84328, Sesto,
copper sfp

drivshell>

@gechiang
Copy link
Collaborator

@BaluAlluru I have filed BRCM CSP case CS00011146863
I don't think you have access to this CSP. I will update you when I get new update from BRCM on this...

@gechiang
Copy link
Collaborator

@BaluAlluru BRCM has requested for the following information to help debug this issue. Please repeat the exact experiment you did with cable unplugged for longer time between TH1 and TH2 DUTs but include the following:
go to BCM shell and issue the following cmd:
bcmsh
drivshell>debug link +
Ctrl-c
The above cmd will capture additional link information into the syslog
Run the unplug experiment that you did and look for the problem that we found at trial 4 previously. Once that is found you can stop the experiment and collect the syslog (show logging) + your trial logs as you did before and attach here so I can forward them to BRCM to investigate further.

BRCM also requested the following:
Is the TH1 box (QFX5200-32C-S) using an external phy or internal phy? If external, what is it?
Thanks!

@BaluAlluru
Copy link

@gechiang,
Thanks for your support.

We enabled Broadcom ASIC logs and repeated the same exercise. We couldn't recreate the issue after repeated tries.
Attached are the logs for reference.
Sept21-Jenkins404-TH1-TH2-OIRlogs.txt

As part of our interoperability testing, we did the same OIR testing exercise between 2 TH1 boxes.
one of the TH1 box has SONIC Jenkins 404 image running on it.
Another TH1 box has other Network operating system running not SONIC.
In this scenario also, we couldn't hit the problem.
Attached are the logs for reference
Sept21-Jenkins404-SONICTH1-OtherNOSTH1-OIRlogs.txt

We will update in this issue tracker, if we are able to recreate the issue.

@gechiang
Copy link
Collaborator

@BaluAlluru The two log files you sent looks pretty good. No issues at all as you also confirmed.
You mention that the problem is no longer reproducible. Wondering if you have also tried NOT to enable the "debug link +" and to confirm that the problem is NOT reproducible as well? If so, then this might be something physical with your cable or the connection not properly seated... If it is a SW issue, it should be repeatable.
Also, can you help answer the other question that BRCM asked in regard to:

"Is the TH1 box (QFX5200-32C-S) using an external phy or internal phy? If external, what is it?"

Thanks!

@ciju-juniper
Copy link
Contributor Author

@gechiang TH1 box (QFX5200-32C-S) is having a PHYLESS design. Tomahawk-1 BCM56960 chip that has 32 Falcon core (FC). All the FCs are directly connected to 32 zQSFP+ ports.

@BaluAlluru
Copy link

BaluAlluru commented Sep 22, 2020

@gechiang
We didn't enable "debug link +" and carried out the same exercise of OIR testing between TH1 and TH2 boxes.

Tried multiple times, but couldn't hit the issue.

Attached are the logs for reference.
Sept22-Jenkins404-TH1-TH2-OIRLogs.txt

@gechiang
Copy link
Collaborator

@ciju-juniper @BaluAlluru Thanks for supplying additional trials and clarifications.
It looks like the problem is no longer reproducible which makes me to suspect that the last issues was most likely a temporary issue where the cable itself may not have seated correctly.
I will update BRCM accordingly and I believe we can close this case for now.
If in future you start to see other issues, please open a new case.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brcmsai Bug 🐛 P0 Priority of the issue
Projects
None yet
Development

No branches or pull requests

6 participants