Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The switch will be crashed when inputting "reboot" command under host #389

Closed
kaiyu22 opened this issue Mar 9, 2017 · 13 comments
Closed
Labels

Comments

@kaiyu22
Copy link
Contributor

kaiyu22 commented Mar 9, 2017

Hi all,

I found that there is a trace back issue in master (commitment 1491bf9). When user inputting "reboot" command under host, the switch will be crashed. After investigating, it maybe relates to commitment a877603 (Merge swss and syncd into single service)

The following is the error message.

[  207.772520] BUG: unable to handle kernel paging request at ffffffffa03ff0f0
[  207.780338] IP: [<ffffffff811a746f>] filp_close+0x1f/0x70
[  207.786396] PGD 1816067 PUD 1817063 PMD 27346a067 PTE 0
[  207.792260] Oops: 0000 [#1] SMP 
[  207.795872] Modules linked in: eeprom_mb(O) eeprom w83795 jc42 coretemp bridge stp llc i2c_mux_pca954x i2c_mux i2c_dev i2c_ismt i2c_i801 kvm_intel kvm crc32_pclmul xt_conntrack iTCO_wdt iTCO_vendor_support iptable_filter ipt_MASQUERADE aesni_intel aes_x86_64 xt_addrtype lrw gf128mul glue_helper ablk_helper lpc_ich mfd_core evdev cryptd serio_raw pcspkr iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipmi_msghandler tpm_tis tpm button nf_nat_ipv4 shpchp i2c_core nf_nat acpi_cpufreq nf_conntrack processor thermal_sys ip_tables x_tables autofs4 loop ext4 crc16 mbcache jbd2 nls_utf8 nls_cp437 vfat fat aufs(C) squashfs sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ehci_pci ehci_hcd scsi_mod igb(O) usbcore usb_common dca ptp pps_core [last unloaded: linux_kernel_bde]
[  207.877421] CPU: 1 PID: 1555 Comm: syncd Tainted: G         C O  3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2
[  207.896132] task: ffff880036dbe190 ti: ffff8802731bc000 task.ti: ffff8802731bc000
[  207.904518] RIP: 0010:[<ffffffff811a746f>]  [<ffffffff811a746f>] filp_close+0x1f/0x70
[  207.913283] RSP: 0018:ffff8802731bfce0  EFLAGS: 00010246
[  207.919232] RAX: ffffffffa03ff080 RBX: ffff88027349ff00 RCX: 0000000000000027
[  207.927227] RDX: ffff880272466858 RSI: ffff880272466800 RDI: ffff88027349ff00
[  207.935215] RBP: ffff880272466800 R08: ffff8802731bc000 R09: 000000000000b8fe
[  207.943211] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  207.951207] R13: ffff880272466800 R14: 0000000000000001 R15: ffff880272466810
[  207.959205] FS:  00007f4388398740(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[  207.968274] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  207.974711] CR2: ffffffffa03ff0f0 CR3: 000000027317b000 CR4: 00000000001007e0
[  207.982708] Stack:
[  207.984956]  00000000000002ff 0000000000000027 0000000000000000 ffffffff811c5898
[  207.993262]  ffff880036dbe810 ffff880272a67400 ffff880272a67460 0000000000000005
[  208.001566]  ffff880036fe8760 ffff880036dbe190 ffffffff81069c0e ffff8802731bff58
[  208.009872] Call Trace:
[  208.012602]  [<ffffffff811c5898>] ? put_files_struct+0x78/0xc0
[  208.019142]  [<ffffffff81069c0e>] ? do_exit+0x28e/0xa70
[  208.024997]  [<ffffffff8106a469>] ? do_group_exit+0x39/0xa0
[  208.031241]  [<ffffffff81078928>] ? get_signal_to_deliver+0x1c8/0x5d0
[  208.038461]  [<ffffffff81013492>] ? do_signal+0x42/0xa10
[  208.044412]  [<ffffffff810779c4>] ? do_send_sig_info+0x54/0x70
[  208.050949]  [<ffffffff81013ed8>] ? do_notify_resume+0x78/0xa0
[  208.057488]  [<ffffffff81518a8a>] ? int_signal+0x12/0x17
[  208.063437] Code: 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 8b 47 38 48 89 fb 48 85 c0 74 44 48 8b 47 28 45 31 e4 48 89 f5 <48> 8b 40 70 48 85 c0 74 05 ff d0 41 89 c4 f6 43 45 40 75 16 48 
[  208.084896] RIP  [<ffffffff811a746f>] filp_close+0x1f/0x70
[  208.091039]  RSP <ffff8802731bfce0>
[  208.094942] CR2: ffffffffa03ff0f0
[  208.098652] ---[ end trace 7dcc5bf3c94d2437 ]---

Does anyone encounter this issue?

@stcheng
Copy link
Contributor

stcheng commented Mar 9, 2017

thanks for submitting the issue. we will try to reproduce this.

@lguohan
Copy link
Collaborator

lguohan commented Mar 18, 2017

ack, I am seeing such issue. will investigate

@jleveque
Copy link
Contributor

jleveque commented Mar 21, 2017

I am seeing this quite consistently.

Also seeing the following stack traces which may or may not be related:

[   71.821095] BUG: unable to handle kernel paging request at ffffffffa03a90d0
[   71.904645] IP: [<ffffffff811bbe1e>] do_vfs_ioctl+0x2be/0x4b0
[   71.973494] PGD 1816067 PUD 1817063 PMD 23393c067 PTE 0
[   72.036418] Oops: 0000 [#1] SMP 
[   72.075201] Modules linked in: sff_8436_eeprom lm75 ltc4215 max6620 jc42 emc1403 regmap_i2c dni_dps460 pmbus_core at24 w83627ehf hwmon_vid dell_s6000_platform(O) i2c_mux_gpio i2c_mux bridge stp llc ip6table_filter ip6_tables i2c_isch xt_conntrack gpio_sch ie6xx_wdt evdev iptable_filter ipt_MASQUERADE xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 dcdbas tpm_tis coretemp nf_nat_ipv4 tpm nf_nat nf_conntrack kvm processor i2c_ismt lpc_sch mfd_core i2c_core shpchp ip_tables efi_pstore pcspkr x_tables button efivars autofs4 loop ext4 crc16 mbcache jbd2 nls_utf8 nls_cp437 vfat fat aufs(C) squashfs sg sd_mod crc_t10dif crct10dif_generic crct10dif_common ohci_pci e1000e ohci_hcd ptp ehci_pci pps_core ehci_hcd usbcore usb_common ahci libahci libata scsi_mod thermal thermal_sys [last unloaded: linux_kernel_bde]
[   72.944344] CPU: 0 PID: 1897 Comm: bcmL2MOD.0 Tainted: G         C O  3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2
[   73.065180] Hardware name: Dell Inc S6000-ON (SI)/S6000 CPU, BIOS 4.6.5 06/16/2015
[   73.155770] task: ffff88023508f670 ti: ffff880235000000 task.ti: ffff880235000000
[   73.245410] RIP: 0010:[<ffffffff811bbe1e>]  [<ffffffff811bbe1e>] do_vfs_ioctl+0x2be/0x4b0
[   73.343372] RSP: 0018:ffff880235003ef0  EFLAGS: 00010a87
[   73.406914] RAX: ffffffffa03a9080 RBX: ffff8800bd08eb00 RCX: 00007facc00e4c70
[   73.492310] RDX: 0000000000004c16 RSI: 0000000000004c16 RDI: 0000000000000027
[   73.577693] RBP: ffff88023553ad58 R08: 0000000001c82fd0 R09: 00007facc5e8dc90
[   73.663068] R10: 00007facc00e4d00 R11: 0000000000000246 R12: 00007facc00e4c70
[   73.748462] R13: 0000000000004c16 R14: 00007facc00e4c70 R15: 0000000000000020
[   73.833847] FS:  00007facc00e5700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[   73.930663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   73.999403] CR2: ffffffffa03a90d0 CR3: 000000023551b000 CR4: 00000000000007f0
[   74.084786] Stack:
[   74.108809]  0000000000000189 00007facc00e4d00 0000000001d27c30 0000000000000000
[   74.197726]  00000000ffffffff ffffffff810d56ee 00000000ffffffff 0000000000000002
[   74.286647]  ffff8800bd08eb01 ffff8800bd08eb00 0000000000000027 ffffffff811bc091
[   74.375567] Call Trace:
[   74.404806]  [<ffffffff810d56ee>] ? SyS_futex+0x6e/0x150
[   74.468345]  [<ffffffff811bc091>] ? SyS_ioctl+0x81/0xa0
[   74.530850]  [<ffffffff815187cd>] ? system_call_fast_compare_end+0x10/0x15
[   74.613105] Code: 00 00 48 89 [   74.635108] RIP  [<ffffffff811bbe1e>] do_vfs_ioctl+0x2be/0x4b0
[   74.635119]  RSP <ffff880235003ef0>
[   74.635121] CR2: ffffffffa03a90d0
[   74.635126] ---[ end trace a689639e9ac2b807 ]---
[   79.828513] BUG: unable to handle kernel paging request at ffffffffa03ee5c2
[   79.830437] IP: [<ffffffffa03ee5c2>] 0xffffffffa03ee5c2
[   79.830442] PGD 1816067 PUD 1817063 PMD 23393c067 PTE 0
[   79.830446] Oops: 0010 [#4] SMP 
[   79.830527] Modules linked in: sff_8436_eeprom lm75 ltc4215 max6620 jc42 emc1403 regmap_i2c dni_dps460 pmbus_core at24 w83627ehf hwmon_vid dell_s6000_platform(O) i2c_mux_gpio i2c_mux bridge stp llc ip6table_filter ip6_tables i2c_isch xt_conntrack gpio_sch ie6xx_wdt evdev iptable_filter ipt_MASQUERADE xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 dcdbas tpm_tis coretemp nf_nat_ipv4 tpm nf_nat nf_conntrack kvm processor i2c_ismt lpc_sch mfd_core i2c_core shpchp ip_tables efi_pstore pcspkr x_tables button efivars autofs4 loop ext4 crc16 mbcache jbd2 nls_utf8 nls_cp437 vfat fat aufs(C) squashfs sg sd_mod crc_t10dif crct10dif_generic crct10dif_common ohci_pci e1000e ohci_hcd ptp ehci_pci pps_core ehci_hcd usbcore usb_common ahci libahci libata scsi_mod thermal thermal_sys [last unloaded: linux_kernel_bde]
[   79.830533] CPU: 2 PID: 1852 Comm: SOC KNET RX Tainted: G      D  C O  3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2
[   79.830535] Hardware name: Dell Inc S6000-ON (SI)/S6000 CPU, BIOS 4.6.5 06/16/2015
[   79.830538] task: ffff8802350695b0 ti: ffff8800bd804000 task.ti: ffff8800bd804000
[   79.830547] RIP: 0010:[<ffffffffa03ee5c2>]  [<ffffffffa03ee5c2>] 0xffffffffa03ee5c2
[   79.830549] RSP: 0018:ffff8800bd807b90  EFLAGS: 00010282
[   79.830551] RAX: 0000000000000004 RBX: 0000000000000000 RCX: 00000000c0000100
[   79.830554] RDX: ffff8800bd807fd8 RSI: ffff8802350695b0 RDI: ffff88023fd12f40
[   79.830556] RBP: 0000000000000000 R08: ffff8800bd804000 R09: 000000000000000e
[   79.830558] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   79.830561] R13: ffff8800bd807c58 R14: ffffffffa03f7c00 R15: 0000000000000010
[   79.830564] FS:  00007facb9d9c700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[   79.830567] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   79.830569] CR2: ffffffffa03ee5c2 CR3: 000000023551b000 CR4: 00000000000007e0
[   79.830571] Stack:
[   79.830576]  0000000000000000 00000001bd663dcc 0000000000000003 ffff880200000000
[   79.830582]  ffff8802350695b0 ffffffff810a8570 ffffffffa03f7c08 ffffffffa03f7c08
[   79.830587]  0000000000000034 ffff8800bd807c58 ffff880235695d58 00007facb9d9bae0
[   79.830588] Call Trace:
[   79.830594]  [<ffffffff810a8570>] ? prepare_to_wait_event+0xf0/0xf0
[   79.830613]  [<ffffffff813c71a3>] ? loopback_xmit+0x63/0xb0
[   79.830620]  [<ffffffff81424767>] ? dev_hard_start_xmit+0x2e7/0x610
[   79.830633]  [<ffffffff811bbe2f>] ? do_vfs_ioctl+0x2cf/0x4b0
[   79.830638]  [<ffffffff810d56ee>] ? SyS_futex+0x6e/0x150
[   79.830643]  [<ffffffff811bc091>] ? SyS_ioctl+0x81/0xa0
[   79.830649]  [<ffffffff815187cd>] ? system_call_fast_compare_end+0x10/0x15
[   79.830656] Code:  Bad RIP value.
[   79.830664] RIP  [<ffffffffa03ee5c2>] 0xffffffffa03ee5c2
[   79.830665]  RSP <ffff8800bd807b90>
[   79.830667] CR2: ffffffffa03ee5c2
[   79.830671] ---[ end trace a689639e9ac2b80a ]---
[   79.839594] Fixing recursive fault but reboot is needed!

@stcheng
Copy link
Contributor

stcheng commented Mar 22, 2017

@marian-pritsak Do you see this error?

@stcheng
Copy link
Contributor

stcheng commented Mar 22, 2017

#425 related?

@pavel-shirshov
Copy link
Contributor

I don't think so. I think it's related to brcm platform only and it requires reboot by power.

@jleveque
Copy link
Contributor

jleveque commented Mar 23, 2017

@stcheng: I was also wondering if they might be related.

@pavel-shirshov: What do you mean "reboot by power?" Are you saying the switch locks up and requires a power cycle? If so, which bug are you referring to? I can safely state that I see both of these problems on my dev switch upon reboot, and the machine will eventually reboot, not requiring a power cycle. A couple weeks ago I found myself needing to power cycle the switch occasionally after reboot, but that seems to have subsided.

@pavel-shirshov
Copy link
Contributor

Joe,
Yes, the switch locks up, it doesn't responds on its console and I need to use a power reset to return it from this state. I'm referring to this #389 (comment) stack trace. When I get such stack trace my development switch locks up.

@stcheng
Copy link
Contributor

stcheng commented Mar 24, 2017

Need to debug the syncd docker. It seems to be some driver related stuffs.
I remove the syncd docker manually before reboot and no such issues found.

@jleveque
Copy link
Contributor

@kaiyu22: It seems as though we have fixed this with PR #434. Can you please build and test an image including this PR and report back?

@kaiyu22
Copy link
Contributor Author

kaiyu22 commented Mar 28, 2017

Sure, I will report later.

@kaiyu22
Copy link
Contributor Author

kaiyu22 commented Mar 29, 2017

@jleveque : After checking, the image which includes PR#434 has fixed this reboot issue.

@stcheng
Copy link
Contributor

stcheng commented Mar 29, 2017

@kaiyu22 thanks. I will close this issue.

@stcheng stcheng closed this as completed Mar 29, 2017
Kalimuthu-Velappan pushed a commit to Kalimuthu-Velappan/sonic-buildimage that referenced this issue Sep 12, 2019
* Use redis binary dump
* Move root check first
madhanmellanox pushed a commit to madhanmellanox/sonic-buildimage that referenced this issue Mar 23, 2020
dmytroxshevchuk pushed a commit to dmytroxshevchuk/sonic-buildimage that referenced this issue Aug 31, 2020
…ic-net#389)

* [mlnx|ffb] Add fast-fast boot option in syncd

Signed-off-by: Stepan Blyschak <[email protected]>

* [mlnx|ffb]: Add support of "config end" event for mlnx fast-fast boot

Signed-off-by: Volodymyr Samotiy <[email protected]>

* [mlnx|ffb]: Fix misspelled words for aspell check

Signed-off-by: Volodymyr Samotiy <[email protected]>

* [Mellanox|FFB]: Fix review comments

* Change naming convention from "fast-fast" to "fastfast"

Signed-off-by: Volodymyr Samotiy <[email protected]>

* [Mellanox|FFB]: Add misspelled word 'fastfast' to aspellcheck dictionary
lguohan pushed a commit that referenced this issue Oct 26, 2020
Advance sonic-swss-common submodule by adding the following  commits 

3ec30ef Deprecate RedisClient and remove unused header file (#399)
165a679 Schema update for BGP internal neighbor table (#389)
262e330 Fix SonicV2Connector interfaces (#396)

Advance sonic-sairedis submodule by adding the following  commits 

bc3e044  [Sai]: Change Sai::set log to level INFO (#680)
b16bc8b Clean code: remove unused header file (#678)
40439b4 [syncd] Remove depreacated dependency on swss::RedisClient (#681)
1b6fc2e [syncd] Add supports of bulk api in syncd (#656)
a9f69c1 [syncd] Add to handle FDB MOVE notification (#670)
c7ef5e9 [gbsyncd] exit with zero when platform has no gearbox (#676)
57228fd [gbsyncd]: add missing python dependency (#675)
02a57a6 [vs] Add CRM SAI attributes to virtual switch interface (#673)
609445a fix boot type for fast boot (#674)
1325cdf Add support for saiplayer bulk API and add performance timers (#666)
1d84b90 Add ZeroMQ communication channel between sairedis and syncd (#659)
017056a Support System ports config (#657)
0f3668f Enable fabric counter for syncd's FlexCounter (#669)
abdosi added a commit that referenced this issue Nov 12, 2020
Schema update for BGP internal neighbor table (#389)

Signed-off-by: Abhishek Dosi <[email protected]>
santhosh-kt pushed a commit to santhosh-kt/sonic-buildimage that referenced this issue Feb 25, 2021
…ic-net#5703)

Advance sonic-swss-common submodule by adding the following  commits 

3ec30ef Deprecate RedisClient and remove unused header file (sonic-net#399)
165a679 Schema update for BGP internal neighbor table (sonic-net#389)
262e330 Fix SonicV2Connector interfaces (sonic-net#396)

Advance sonic-sairedis submodule by adding the following  commits 

bc3e044  [Sai]: Change Sai::set log to level INFO (sonic-net#680)
b16bc8b Clean code: remove unused header file (sonic-net#678)
40439b4 [syncd] Remove depreacated dependency on swss::RedisClient (sonic-net#681)
1b6fc2e [syncd] Add supports of bulk api in syncd (sonic-net#656)
a9f69c1 [syncd] Add to handle FDB MOVE notification (sonic-net#670)
c7ef5e9 [gbsyncd] exit with zero when platform has no gearbox (sonic-net#676)
57228fd [gbsyncd]: add missing python dependency (sonic-net#675)
02a57a6 [vs] Add CRM SAI attributes to virtual switch interface (sonic-net#673)
609445a fix boot type for fast boot (sonic-net#674)
1325cdf Add support for saiplayer bulk API and add performance timers (sonic-net#666)
1d84b90 Add ZeroMQ communication channel between sairedis and syncd (sonic-net#659)
017056a Support System ports config (sonic-net#657)
0f3668f Enable fabric counter for syncd's FlexCounter (sonic-net#669)
mssonicbld added a commit that referenced this issue Jul 21, 2023
…D automatically (#15918)

#### Why I did it
src/sonic-platform-daemons
```
* 76baca3 - (HEAD -> master, origin/master, origin/HEAD) Fixes for the issues uncovered by sonic-pcied unit tests (#389) (32 hours ago) [Ashwin Srinivasan]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this issue Aug 9, 2023
… automatically (#16061)

#### Why I did it
src/sonic-platform-common
```
* 5af6f9f - (HEAD -> 202305, origin/202305) Comment out tx power validation check and program the passed value  (#389) (3 days ago) [abdosi]
```
#### How I did it
#### How to verify it
#### Description for the changelog
yxieca pushed a commit that referenced this issue Aug 16, 2023
… automatically (#16153)

src/sonic-platform-common

* a6dd67e - (HEAD -> 202205, origin/202205) Comment out tx power validation check and program the passed value  (#389) (29 hours ago) [abdosi]
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this issue Sep 20, 2023
…D automatically (sonic-net#15918)

#### Why I did it
src/sonic-platform-daemons
```
* 76baca3 - (HEAD -> master, origin/master, origin/HEAD) Fixes for the issues uncovered by sonic-pcied unit tests (sonic-net#389) (32 hours ago) [Ashwin Srinivasan]
```
#### How I did it
#### How to verify it
#### Description for the changelog
mssonicbld added a commit that referenced this issue May 11, 2024
…tomatically (#18922)

#### Why I did it
src/sonic-linux-kernel
```
* f182d03 - (HEAD -> master, origin/master, origin/HEAD) Kernel changes for support of elba DSS (#389) (2 days ago) [Shantanu Shrivastava]
```
#### How I did it
#### How to verify it
#### Description for the changelog
DavidZagury pushed a commit to DavidZagury/sonic-buildimage that referenced this issue Dec 7, 2024
This patchset adds support for AMD-Pensando DPU on MtFuji DSS.
MtFuji is a DSS being developed in collaboration with AMD-Pensando and
Cisco for data center applications.
MtFuji mounts elba based nic which is an AMD-Pensando PCI Distributed
Services Card (DSC) whose support has been added in SONiC.

The changes are verified on Pensando DSS-MTFUJI card.
There is one 200G uplink port and no management port.
The link and traffic has been tested on the port.

Signed-off-by: Shantanu Shrivastava <[email protected]>
Signed-off-by: Sahil Chaudhari <[email protected]>
Co-authored-by: Saikrishna Arcot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants