-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
searchd hang with 100% times number of cpu cores #436
Comments
It looks like you've backtraced wrong process. OR the reason is somewhere in the system. All the things daemon does on startup (in daemon mode) is double-fork - to detach from controling terminal and session leadership. That is simple fork() + check retcode + exit on uninteresting branch, and you've parsed dying process on that stage. |
this is the second process backtrace #0 0x00007f140d9a79a3 in select () from /lib64/libc.so.6 |
this backtrace looks like from real daemon process but it ends up at |
It would be more relevant, if you notice hanged process/thread with 'top' command, and then perform 'gdb attach' to this traced number. |
this the full backtraces (gdb) info threads
(gdb) thread apply all bt Thread 20 (Thread 0x7f5300540700 (LWP 13913)): Thread 19 (Thread 0x7f530051f700 (LWP 13914)): Thread 18 (Thread 0x7f53004fe700 (LWP 13915)): Thread 17 (Thread 0x7f53004dd700 (LWP 13916)): Thread 16 (Thread 0x7f53004bc700 (LWP 13917)): Thread 15 (Thread 0x7f530049b700 (LWP 13918)): Thread 14 (Thread 0x7f530047a700 (LWP 13919)): Thread 13 (Thread 0x7f5300459700 (LWP 13920)): Thread 12 (Thread 0x7f5300438700 (LWP 13921)): Thread 11 (Thread 0x7f5300417700 (LWP 13922)): Thread 10 (Thread 0x7f53003f6700 (LWP 13923)): Thread 9 (Thread 0x7f53003d5700 (LWP 13924)): Thread 8 (Thread 0x7f53003b4700 (LWP 13925)): Thread 7 (Thread 0x7f5300393700 (LWP 13926)): Thread 6 (Thread 0x7f52fd419700 (LWP 13927)): Thread 5 (Thread 0x7f52fd3f8700 (LWP 13928)): Thread 4 (Thread 0x7f52fd1c4700 (LWP 13929)): Thread 3 (Thread 0x7f529c87b700 (LWP 13931)): Thread 2 (Thread 0x7f529c85a700 (LWP 13932)): Thread 1 (Thread 0x7f5300542900 (LWP 13912)): (gdb) bt (gdb) info locals |
last log message while server was hang [root@p1 ~]# tail -n50 /var/log/manticore/searchd.log |
Every thread in your trace sleep in 'wait' or 'lock' function, none of them in normal mode occupy any quantum of CPU at all. If you suspect searchd - catch exact PID with 'top' command, ensure that it is really searchd hangs. |
It was the right process id, today it happens again, I think it is some type of exceeding the maximum allowed events This is the strace and backtrace [root@p1 ~]# strace -p 9325 -f [root@p1 ~]# gdb attach 9325
(gdb) thread apply all bt Thread 23 (Thread 0x7f03b25a1700 (LWP 9326)): Thread 22 (Thread 0x7f03b2580700 (LWP 9327)): Thread 21 (Thread 0x7f03b255f700 (LWP 9328)): Thread 20 (Thread 0x7f03b253e700 (LWP 9329)): Thread 19 (Thread 0x7f03b251d700 (LWP 9330)): Thread 18 (Thread 0x7f03b24fc700 (LWP 9331)): Thread 17 (Thread 0x7f03b24db700 (LWP 9332)): Thread 16 (Thread 0x7f03b24ba700 (LWP 9333)): Thread 15 (Thread 0x7f03b2499700 (LWP 9334)): Thread 14 (Thread 0x7f03b2478700 (LWP 9335)): Thread 13 (Thread 0x7f03b2457700 (LWP 9336)): Thread 12 (Thread 0x7f03b2436700 (LWP 9337)): Thread 11 (Thread 0x7f03b2415700 (LWP 9338)): Thread 10 (Thread 0x7f03b23f4700 (LWP 9339)): Thread 9 (Thread 0x7f03af479700 (LWP 9340)): Thread 8 (Thread 0x7f03af458700 (LWP 9341)): Thread 7 (Thread 0x7f03af224700 (LWP 9342)): Thread 6 (Thread 0x7f03af15c700 (LWP 9343)): Thread 5 (Thread 0x7f034e62d700 (LWP 9344)): Thread 4 (Thread 0x7f034e60c700 (LWP 9345)): Thread 3 (Thread 0x7f034e177700 (LWP 29871)): Thread 2 (Thread 0x7f034fdf6700 (LWP 29873)): Thread 1 (Thread 0x7f03b25a3900 (LWP 9325)): |
We've looked into the dump with developers and it's still not clear what can be the likely reason. We also didn't hear from anyone else recently about a similar issue. Can you provide your index and a part of query log which we could run in a loop on our side to reproduce the issue? If you can please use our write-only FTP https://mnt.cr/shithappens |
For many days trying to catch how it happens with luck, then I remembered that starting version 3.5 I moved to the distributed rpm package, so I decided to compile the source code on my three servers (centos 7.8, 7.9, and 8.2) cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DDISTR_BUILD=centos8 -DUSE_SSL=0 -DWITH_STEMMER=1 -DUSE_JEMALLOC=1 -DUSE_BISON=1 -DUSE_FLEX=1 -DUSE_SYSLOG=0 -DWITH_ICONV=1 -DWITH_RE2=1 -DWITH_ZLIB=1 -DWITH_MYSQL=1 -DWITH_ODBC=0 -DBoost_DEBUG=1 -DWITH_PGSQL=0 -DUSE_GALERA=0 .. It looks the problem was produced by the installed rpm package. here it the status after 102 hours Maybe this information will be helpful for such a case in the future. |
Manticore 3.5 and above hang with 100% cpu usage daily once or twice, I have tried many times to reproduce the case with no luck.
I used gdb to this back trace when it was hanged.
(gdb) bt
#0 0x00007f140c4d55b0 in _fini () from /lib64/libpcre.so.1
#1 0x00007f140f3be098 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2 0x00007f140d8ebce9 in __run_exit_handlers () from /lib64/libc.so.6
#3 0x00007f140d8ebd37 in exit () from /lib64/libc.so.6
#4 0x00000000005a9427 in SetWatchDog (iDevNull=3) at /usr/src/debug/manticore-3.5.2-201030-b55cd3c-release-rhel7/applications/src_0/src/searchd.cpp:17449
#5 0x00000000005c466f in ServiceMain (argc=3, argv=) at /usr/src/debug/manticore-3.5.2-201030-b55cd3c-release-rhel7/applications/src_0/src/searchd.cpp:18597
#6 0x000000000059320e in main (argc=3, argv=0x7ffeaaf8e0f8) at /usr/src/debug/manticore-3.5.2-201030-b55cd3c-release-rhel7/applications/src_0/src/searchd.cpp:19030
hope it helps to solve this bug.
Linux p1.mourjan.com 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
index checked with no problem.
this last messages from log before hang
[Tue Nov 3 12:22:06.373 2020] [32607] caught SIGHUP (seamless=1, in_rotate=0, need_rotate=0)
[Tue Nov 3 12:22:06.373 2020] [32619] rotating index 'locality_counts': started
[Tue Nov 3 12:22:06.374 2020] [32619] rotating index 'locality_counts': success
[Tue Nov 3 12:22:06.375 2020] [32619] rotating index 'section_tag_counts': started
[Tue Nov 3 12:22:06.376 2020] [32619] rotating index 'section_tag_counts': success
[Tue Nov 3 12:22:06.377 2020] [32619] rotating index 'section_counts': started
[Tue Nov 3 12:22:06.378 2020] [32619] rotating index 'section_counts': success
[Tue Nov 3 12:22:06.379 2020] [32619] rotating index 'adx': started
[Tue Nov 3 12:22:06.380 2020] [32619] rotating index 'adx': success
[Tue Nov 3 12:22:06.381 2020] [32619] rotating index: all indexes done
[Tue Nov 3 12:25:33.904 2020] [32607] caught SIGHUP (seamless=1, in_rotate=0, need_rotate=0)
[Tue Nov 3 12:25:33.905 2020] [32621] rotating index 'locality_counts': started
[Tue Nov 3 12:25:33.906 2020] [32621] rotating index 'locality_counts': success
[Tue Nov 3 12:25:33.906 2020] [32621] rotating index 'section_tag_counts': started
[Tue Nov 3 12:25:33.907 2020] [32621] rotating index 'section_tag_counts': success
[Tue Nov 3 12:25:33.908 2020] [32621] rotating index 'section_counts': started
[Tue Nov 3 12:25:33.909 2020] [32621] rotating index 'section_counts': success
[Tue Nov 3 12:25:33.910 2020] [32621] rotating index 'adx': started
[Tue Nov 3 12:25:33.910 2020] [32621] rotating index 'adx': success
[Tue Nov 3 12:25:33.911 2020] [32621] rotating index: all indexes done
Regards
The text was updated successfully, but these errors were encountered: