Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manticore should not rely on DNS when running in kubernetes. #6

Open
zerthimon opened this issue Aug 12, 2021 · 2 comments
Open

Manticore should not rely on DNS when running in kubernetes. #6

zerthimon opened this issue Aug 12, 2021 · 2 comments
Labels
waiting Waiting for the original poster (in most cases) or something else

Comments

@zerthimon
Copy link

zerthimon commented Aug 12, 2021

Manticore should not rely on DNS for cluster nodes resolution when working in kubernetes.

When a pod restarts because of node failure/upgrade/etc, its entry is removed from the kube dns (core-dns). Manticore then crashes on the remaining node with error, because it cannot resolve the failed node by name:

[Thu Aug 12 12:18:54.614 2021] [13] FATAL: no AF_INET address found for: backend-manticoresearch-worker-1.backend-manticoresearch-worker  
FATAL: no AF_INET address found for: backend-manticoresearch-worker-1.backend-manticoresearch-worker  
[Thu Aug 12 12:18:54.673 2021] [1] caught SIGTERM, shutting down  
caught SIGTERM, shutting down  
------- FATAL: CRASH DUMP -------  
[Thu Aug 12 12:18:54.673 2021] [    1]  
[Thu Aug 12 12:19:19.674 2021] [1] WARNING: GlobalCrashQueryGetRef: thread-local info is not set! Use ad-hoc  
WARNING: GlobalCrashQueryGetRef: thread-local info is not set! Use ad-hoc  
  
--- crashed invalid query ---  
  
--- request dump end ---  
--- local index:  
Manticore 3.6.0 96d61d8bf@210504 release  
Handling signal 11  
Crash!!! Handling signal 11  
-------------- backtrace begins here ---------------  
Program compiled with 7  
Configured with flags: Configured by CMake with these definitions: -DCMAKE_BUILD_TYPE=RelWithDebInfo -DDISTR_BUILD=bionic -DUSE_SSL=ON -DDL_UNIXODBC=1 -DUNIXODBC_LIB=libodbc.so.2 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DUSE_LIBICONV=1 -DDL_MYSQL=1 -DMYSQL_LIB=libmysqlclient.so.20 -DDL_PGSQL=1 -DPGSQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/data -DFULL_SHARE_DIR=/usr/share/manticore -DUSE_RE2=1 -DUSE_ICU=1 -DUSE_BISON=ON -DUSE_FLEX=ON -DUSE_SYSLOG=1 -DWITH_EXPAT=1 -DWITH_ICONV=ON -DWITH_MYSQL=1 -DWITH_ODBC=ON -DWITH_PGSQL=1 -DWITH_RE2=1 -DWITH_STEMMER=1 -DWITH_ZLIB=ON -DGALERA_SONAME=libgalera_manticore.so.31 -DSYSCONFDIR=/etc/manticoresearch  
Host OS is Linux x86_64  
Stack bottom = 0x7fff43fad227, thread stack size = 0x20000  
Trying manual backtrace:  
Something wrong with thread stack, manual backtrace may be incorrect (fp=0x5c95bbd0f9002)  
Wrong stack limit or frame pointer, manual backtrace failed (fp=0x5c95bbd0f9002, stack=0x564947870000, stacksize=0x20000)  
Trying system backtrace:  
begin of system symbols:  
searchd(_Z12sphBacktraceib+0xcb)[0x564946faf75b]  
searchd(_ZN11CrashLogger11HandleCrashEi+0x1ac)[0x564946dcd66c]  
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fa45a7fa890]  
searchd(_ZN11CSphNetLoop11StopNetLoopEv+0xa)[0x564946eb978a]  
searchd(_Z8Shutdownv+0xd0)[0x564946dd2c00]  
searchd(_Z12CheckSignalsv+0x63)[0x564946de04a3]  
searchd(_Z8TickHeadv+0x1b)[0x564946de04fb]  
searchd(_Z11ServiceMainiPPc+0x1cea)[0x564946dfa5ea]  
searchd(main+0x63)[0x564946dcb6a3]  
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fa4594b4b97]  
searchd(_start+0x2a)[0x564946dcca6a]  
-------------- backtrace ends here ---------------  
Please, create a bug report in our bug tracker (https://github.com/manticoresoftware/manticore/issues)  
and attach there:  
a) searchd log, b) searchd binary, c) searchd symbols.  
Look into the chapter 'Reporting bugs' in the manual  
(https://manual.manticoresearch.com/Reporting_bugs)  
Dump with GDB is not available  
--- BT to source lines (depth 11): ---  
conversion failed (error 'No such file or directory'):  
  1. Run the command provided below over the crashed binary (for example, 'searchd'):  
  2. Attach the source.txt to the bug report.  
addr2line -e searchd 0x46faf75b 0x46dcd66c 0x5a7fa890 0x46eb978a 0x46dd2c00 0x46de04a3 0x46de04fb   
0x46dfa5ea 0x46dcb6a3 0x594b4b97 0x46dcca6a > source.txt  

After cluster node comes back online, the remaining node cannot start because it cannot resolve its own IP due to it's own entry got removed from DNS. The NXDOMAIN DNS response is cached by the kubernetes cluster node OS time and again, so node cannot start anymore at all.

@zerthimon
Copy link
Author

While the above referenced issue is resolved, for the kubernetes-based installations, manticore should not rely on DNS as it's unreliable overall (with entries removed and added dynamically and the DNS caching), Instead manticore should query kube-api directly for the IPs of it's pods.

@sanikolaev
Copy link
Collaborator

Can you please elaborate more on

Manticore should not rely on DNS as it's unreliable overall (with entries removed and added dynamically and the DNS caching)

and provide some example?

@sanikolaev sanikolaev reopened this Oct 21, 2021
@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting Waiting for the original poster (in most cases) or something else
Projects
None yet
Development

No branches or pull requests

2 participants