Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gearsyncd crash and core observed after continuous warm reboot #6172

Closed
vaibhavhd opened this issue Dec 9, 2020 · 5 comments
Closed

Gearsyncd crash and core observed after continuous warm reboot #6172

vaibhavhd opened this issue Dec 9, 2020 · 5 comments

Comments

@vaibhavhd
Copy link
Contributor

vaibhavhd commented Dec 9, 2020

Description

Gearsyncd crash seen after warm reboot, producing core file.
Core was generated by `/usr/bin/gearsyncd -p /usr/share/sonic/hwsku/gearbox_config.json'.

The issue seems to have been hit due to a GP fault reported by Kernel during gearsyncd warm start.

Dec 9 05:28:18.287552 vlab-01 INFO kernel: [ 41.785625] traps: gearsyncd[2570] general protection ip:7f385ce98fea sp:7f385c916690 error:0 in libswsscommon.so.0.0.0[7f385ce7c000+54000]

Steps to reproduce the issue:

  1. Run continuous_warm_reboot test on a VS image (the issue should be reproducible on a physical device too).
  2. The issue is reproducible within 30 iterations.
  3. Core is produced (which fails the test) and gearsyncd crash is seen on syslog.

Describe the results you received:
GP fault in gearsyncd processing path happened, producing gearsyncd core.

Dec  9 05:28:12.176008 vlab-01 NOTICE root: Starting syncd service...
Dec  9 05:28:12.186609 vlab-01 NOTICE root: Starting gbsyncd service...
Dec  9 05:28:12.191828 vlab-01 NOTICE root: Locking /tmp/swss-gbsyncd-lock from gbsyncd service
Dec  9 05:28:12.197144 vlab-01 NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
Dec  9 05:28:12.215922 vlab-01 NOTICE root: Locked /tmp/swss-gbsyncd-lock (10) from gbsyncd service
Dec  9 05:28:12.225194 vlab-01 NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
Dec  9 05:28:12.912571 vlab-01 NOTICE root: Warm boot flag: gbsyncd true.
Dec  9 05:28:12.920948 vlab-01 NOTICE root: Warm boot flag: syncd true.

Dec  9 05:28:13.510244 vlab-01 INFO syncd.sh[1474]: syncd
Dec  9 05:28:13.516518 vlab-01 NOTICE root: Started syncd service...
Dec  9 05:28:13.535676 vlab-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Dec  9 05:28:13.551573 vlab-01 INFO systemd[1]: Started syncd service.
Dec  9 05:28:13.643762 vlab-01 INFO gbsyncd.sh[1475]: gbsyncd
Dec  9 05:28:13.650119 vlab-01 NOTICE root: Started gbsyncd service...
Dec  9 05:28:13.662793 vlab-01 NOTICE root: Unlocking /tmp/swss-gbsyncd-lock (10) from gbsyncd service
Dec  9 05:28:13.688510 vlab-01 INFO systemd[1]: Started gbsyncd service.

Dec  9 05:28:18.286066 vlab-01 NOTICE swss#gearsyncd: :- checkWarmStart: gearsyncd doing warm start, restore count 62
Dec  9 05:28:18.287552 vlab-01 INFO kernel: [   41.785625] traps: gearsyncd[2570] general protection ip:7f385ce98fea sp:7f385c916690 error:0 in libswsscommon.so.0.0.0[7f385ce7c000+54000]
Dec  9 05:28:25.441664 vlab-01 INFO swss#supervisord 2020-12-09 05:28:18,240 INFO spawned: 'gearsyncd' with pid 36
Dec  9 05:28:25.441664 vlab-01 INFO swss#supervisord 2020-12-09 05:28:18,240 INFO success: gearsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Dec  9 05:28:25.441664 vlab-01 INFO swss#supervisord 2020-12-09 05:28:18,425 INFO exited: gearsyncd (terminated by SIGSEGV (core dumped); not expected)

Describe the results you expected:
Warm reboot without any core/crash.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
admin@vlab-01:~$ show ver

SONiC Software Version: SONiC.master.580-43a32e60
Distribution: Debian 10.7
Kernel: 4.19.0-9-2-amd64
Build commit: 43a32e60
Build date: Tue Dec  8 09:18:44 UTC 2020
Built by: johnar@jenkins-worker-11

Platform: x86_64-kvm_x86_64-r0
HwSKU: Force10-S6000
ASIC: vs
ASIC Count: 1
Serial Number: 000000
**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@vaibhavhd
Copy link
Contributor Author

Bt from the core:

(gdb) bt
#0  0x00007f385ce98fea in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (this=this@entry=0x7f385cee83e0 <swss::SonicDBConfig::m_db_info[abi:cxx11]>, __n=0, __k="", __code=6142509188972423790)
    at /usr/include/c++/8/bits/hashtable.h:1554
#1  0x00007f385ce90fd1 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (__c=<optimized out>, __key="", __bkt=<optimized out>, this=0x7f385cee83e0 <swss::SonicDBConfig::m_db_info[abi:cxx11]>)
    at /usr/include/c++/8/bits/hashtable.h:651
#2  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find (__k="", this=0x7f385cee83e0 <swss::SonicDBConfig::m_db_info[abi:cxx11]>) at /usr/include/c++/8/bits/hashtable.h:1441
#3  std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, swss::SonicDBInfo> > > > > >::find (__x="", 
    this=0x7f385cee83e0 <swss::SonicDBConfig::m_db_info[abi:cxx11]>) at /usr/include/c++/8/bits/unordered_map.h:921
#4  swss::SonicDBConfig::getDbInfo (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:220
#5  0x00007f385ce912cf in swss::SonicDBConfig::getSeparator (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:282
#6  0x00007f385ce91d2a in swss::SonicDBConfig::getSeparator[abi:cxx11](swss::DBConnector const*) (db=db@entry=0x7f385c916e20) at dbconnector.cpp:330
#7  0x00007f385cebc7f0 in swss::ConsumerTableBase::ConsumerTableBase (this=0x7f3858003d80, db=0x7f385c916e20, tableName="", popBatchSize=128, pri=0) at consumertablebase.cpp:5
#8  0x00007f385cebd10a in swss::ConsumerStateTable::ConsumerStateTable (this=0x7f3858003d80, db=0x7f385c916e20, tableName="", popBatchSize=<optimized out>, pri=<optimized out>)
    at consumerstatetable.cpp:14
#9  0x00007f385ce88dab in __gnu_cxx::new_allocator<swss::ConsumerStateTable>::construct<swss::ConsumerStateTable, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (this=<optimized out>, __p=0x7f3858003d80) at /usr/include/c++/8/new:169
#10 std::allocator_traits<std::allocator<swss::ConsumerStateTable> >::construct<swss::ConsumerStateTable, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __p=0x7f3858003d80) at /usr/include/c++/8/bits/alloc_traits.h:475
#11 std::_Sp_counted_ptr_inplace<swss::ConsumerStateTable, std::allocator<swss::ConsumerStateTable>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., this=0x7f3858003d70) at /usr/include/c++/8/bits/shared_ptr_base.h:545
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<swss::ConsumerStateTable, std::allocator<swss::ConsumerStateTable>, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __p=<synthetic pointer>: <optimized out>, this=<synthetic pointer>) at /usr/include/c++/8/bits/shared_ptr_base.h:677
#13 std::__shared_ptr<swss::ConsumerStateTable, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<swss::ConsumerStateTable>, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__tag=..., this=<synthetic pointer>) at /usr/include/c++/8/bits/shared_ptr_base.h:1342
#14 std::shared_ptr<swss::ConsumerStateTable>::shared_ptr<std::allocator<swss::ConsumerStateTable>, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__tag=..., this=<synthetic pointer>) at /usr/include/c++/8/bits/shared_ptr.h:359
#15 std::allocate_shared<swss::ConsumerStateTable, std::allocator<swss::ConsumerStateTable>, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=...) at /usr/include/c++/8/bits/shared_ptr.h:706
#16 std::make_shared<swss::ConsumerStateTable, swss::DBConnector*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> ()
    at /usr/include/c++/8/bits/shared_ptr.h:722
--Type <RET> for more, q to quit, c to continue without paging--c
#17 swss::Logger::settingThread (this=0x7f385cee82a0 <swss::Logger::getInstance()::m_logger>) at logger.cpp:182
#18 0x00007f385cd73b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007f385c93efa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007f385ca514cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

@vaibhavhd
Copy link
Contributor Author

Although the stacktrace is different, the root cause of this issue appears to be identical with #6103

In this issue also the FAULT is hit when getDBInfo calls m_db_info.find(netns) with empty netns.

This is where it gets similar to the other issue - An EMPTY_NAMESPACE is searched in an empty unordered_map.

https://github.com/Azure/sonic-swss-common/blob/cf9cc37547bcbec8ac32c3f651335dfdb4b0f635/common/dbconnector.cpp#L220

(gdb) f 4
#4  swss::SonicDBConfig::getDbInfo (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:220
220     dbconnector.cpp: No such file or directory.
(gdb) info locals
logger__LINE__ = {m_line = 208, 
  m_fun = 0x7f385ced12f0 <swss::SonicDBConfig::getDbInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::__FUNCTION__> "getDbInfo"}
__FUNCTION__ = "getDbInfo"
foundNetns = <optimized out>
infos = <optimized out>
foundDb = <optimized out>
(gdb) p m_db_info
$1 = std::unordered_map with 0 elements
(gdb) p netns
$2 = ""
(gdb) p m_init
$3 = true
(gdb) 

@vaibhavhd
Copy link
Contributor Author

This should be a multi-thread-mishandling/race condition issue due to the fact that this issue is not consistent and seen only once in ~30 iterations of gearsyncd warm start.

@vaibhavhd
Copy link
Contributor Author

Duplicate of #6103

@vaibhavhd vaibhavhd marked this as a duplicate of #6103 Dec 23, 2020
@daall
Copy link
Contributor

daall commented Dec 23, 2020

closing as duplicate

@daall daall closed this as completed Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants