Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel GP during swss/syncd/teamd shutdown #6103

Closed
yxieca opened this issue Dec 2, 2020 · 4 comments · Fixed by sonic-net/sonic-swss-common#444
Closed

Kernel GP during swss/syncd/teamd shutdown #6103

yxieca opened this issue Dec 2, 2020 · 4 comments · Fixed by sonic-net/sonic-swss-common#444
Assignees
Labels
Master Branch Quality P0 Priority of the issue

Comments

@yxieca
Copy link
Contributor

yxieca commented Dec 2, 2020

Description

Steps to reproduce the issue:

  1. run autorestart test for teamd, swss, or syncd

Describe the results you received:
Test found following error in syslog, the system recovered from the test, GP happened during service shutdown. But this issue should be addressed anyways.

INFO kernel: [34985.337998] traps: gearsyncd[16050] general protection ip:7fe030bb1aea sp:7fe03062f9b0 error:0 in libswsscommon.so.0.0.0[7fe030b95000+51000]\n

Describe the results you expected:
test pass

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```

SONiC Software Version: SONiC.master.507-7f21c0be
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: 7f21c0b
Build date: Sat Nov 28 05:20:26 UTC 2020
Built by: johnar@jenkins-worker-8
```

@yxieca
Copy link
Contributor Author

yxieca commented Dec 2, 2020

This issue is not easily repeatable. I only see once in available history.

@yxieca
Copy link
Contributor Author

yxieca commented Dec 2, 2020

(gdb) bt
#0 0x00007fe030bb1aea in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (
this=this@entry=0x7fe030bfe3c0 swss::SonicDBConfig::m_db_info[abi:cxx11], __n=0, __k="", __code=6142509188972423790) at /usr/include/c++/8/bits/hashtable.h:1554
#1 0x00007fe030ba9ad1 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (__c=, __key="",
__bkt=, this=0x7fe030bfe3c0 swss::SonicDBConfig::m_db_info[abi:cxx11]) at /usr/include/c++/8/bits/hashtable.h:651
#2 std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find (__k="", this=0x7fe030bfe3c0 swss::SonicDBConfig::m_db_info[abi:cxx11])
at /usr/include/c++/8/bits/hashtable.h:1441
#3 std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, swss::SonicDBInfo, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, swss::SonicDBInfo> > > > > >::find (__x="", this=0x7fe030bfe3c0 swss::SonicDBConfig::m_db_info[abi:cxx11]) at /usr/include/c++/8/bits/unordered_map.h:921
#4 swss::SonicDBConfig::getDbInfo (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:220
#5 0x00007fe030ba9da9 in swss::SonicDBConfig::getDbId (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:277
#6 0x00007fe030baa1ee in swss::DBConnector::DBConnector (this=0x7fe03062fe20, dbName="LOGLEVEL_DB", timeout=0, isTcpConn=, netns="") at dbconnector.cpp:533
#7 0x00007fe030baa38a in swss::DBConnector::DBConnector (this=, dbName=..., timeout=, isTcpConn=)
at /usr/include/c++/8/bits/char_traits.h:287
#8 0x00007fe030ba0f33 in swss::Logger::settingThread (this=0x7fe030bfe280 swss::Logger::getInstance()::m_logger) at /usr/include/c++/8/ext/new_allocator.h:79
#9 0x00007fe030a8cb2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fe030657fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007fe03076a4cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

@yxieca
Copy link
Contributor Author

yxieca commented Dec 2, 2020

(gdb) f 4
#4 swss::SonicDBConfig::getDbInfo (dbName="LOGLEVEL_DB", netns="") at dbconnector.cpp:220
220 dbconnector.cpp: No such file or directory.
(gdb) info locals
logger__LINE__ = {m_line = 208,
m_fun = 0x7fe030be72f0 <swss::SonicDBConfig::getDbInfo(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)::FUNCTION> "getDbInfo"}
FUNCTION = "getDbInfo"
foundNetns =
infos =
foundDb =
(gdb) p m_db_info
$1 = std::unordered_map with 0 elements
(gdb) p netns
$2 = ""
(gdb) p m_init
$3 = true

@yxieca
Copy link
Contributor Author

yxieca commented Dec 2, 2020

A few things are interesting:

  1. There is no db information in m_db_info (unordered_map), that is in a way consistent with swss being shutdown.
  2. netns "" is EMPTY_NAMESPACE.

Looking up an empty map should yield nothing instead of segment faulting. I suspect this is a multi-thread issue where a thread made a query while the global variable m_db_info is changing. I did some research, it is believed that STL is mutl-thread safe only for reads. But not for writes.

@daall daall added the P0 Priority of the issue label Dec 23, 2020
@daall daall assigned vaibhavhd and unassigned lguohan Dec 23, 2020
vaibhavhd added a commit to sonic-net/sonic-swss-common that referenced this issue Jan 12, 2021
Fixes: sonic-net/sonic-buildimage#6103
This PR is to fix the Kernel GP errors that are seen in any short-lived process within swss.
Issue:
As part of *syncd initialization, Logger::linkToDbNative function is called and a thread is started. When the main *syncd process terminates the destructor Logger::~Logger simply detaches the SettingThread. The exiting main process deletes the static variables. Fault is hit when the detached thread (still executing in the infinite loop) tries to access these freed up variables.
Fix:
Before exiting the main process, set the flag in Logger destructor to signal the detached thread that the main process is finishing up. Instead of detaching the thread (which leaves this thread access the undefined static variables), join the SettingThread thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Master Branch Quality P0 Priority of the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants