Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent core dump when running query with SNIPPET() function #387

Closed
popalot2 opened this issue Aug 12, 2020 · 22 comments
Closed

intermittent core dump when running query with SNIPPET() function #387

popalot2 opened this issue Aug 12, 2020 · 22 comments
Labels

Comments

@popalot2
Copy link

popalot2 commented Aug 12, 2020

Describe the bug
manticore dev 3.5.1 produces coredump intermittently when running a query (~5% of the time hit coredump 95% of time query works fine).

To Reproduce
unclear yet

Expected behavior
expect not to produce coredump

Describe the environment:

  • Manticore Search version Manticore 3.5.1 04af034@200811 release
  • OS version Centos 7 x64 Linux main 5.6.14-1.el7.elrepo.x86_64 #1 SMP Tue May 19 12:17:13 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Messages from log files:
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/searchd --coredump --config /etc/manticoresearch/manticore.conf'.
Program terminated with signal 11, Segmentation fault.
#0 SnippetBuilder_c::Impl_c::Setup (this=0x7efc1c010540, pIndex=0x0, tSettings=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexcerpt.cpp:1369
1369 m_pDict = GetStatelessDict ( pIndex->GetDictionary () );
(gdb) p
The history is empty.
(gdb) where
#0 SnippetBuilder_c::Impl_c::Setup (this=0x7efc1c010540, pIndex=0x0, tSettings=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexcerpt.cpp:1369
#1 0x00000000006fe6d8 in SnippetBuilder_c::Setup (this=, pIndex=, tQuery=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexcerpt.cpp:1590
#2 0x000000000092ffa4 in Expr_Snippet_c::Expr_Snippet_c (this=0x7efc1c00fc20, pArglist=0x7efc1c00fbf0, pIndex=, pProfiler=, eQueryType=, sError=...)
at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/searchdexpr.cpp:365
#3 0x0000000000932a07 in ExprHook_c::CreateNode (this=0x7ef6ec0413e0, iID=, pLeft=0x7efc1c00fbf0, pEvalStage=, sError=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/searchdexpr.cpp:727
#4 0x000000000082b57b in ExprParser_t::CreateTree (this=this@entry=0x7efc500fe340, iNode=13) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexpr.cpp:6206
#5 0x000000000082eb3c in ExprParser_t::Parse (this=this@entry=0x7efc500fe340, sExpr=sExpr@entry=0x7efc50017ce0 "snippet(subtitles, QUERY(),'before_match=<<<','after_match=>>>','limit=300','around=30','limit_passages=1')", tSchema=..., pAttrType=0x7efc500fe4f8,
pUsesWeight=0x7efc500fe542, sError=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexpr.cpp:9309
#6 0x000000000082ee86 in sphExprParse (sExpr=0x7efc50017ce0 "snippet(subtitles, QUERY(),'before_match=<<<','after_match=>>>','limit=300','around=30','limit_passages=1')", tSchema=..., sError=..., tArgs=...)
at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxexpr.cpp:9347
#7 0x000000000073f1b0 in QueueCreator_c::ParseQueryItem (this=this@entry=0x7efc1c4afb20, tItem=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxsort.cpp:6239
#8 0x000000000073f832 in operator() (v=..., __closure=) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxsort.cpp:6430
#9 TestAll<QueueCreator_c::MaybeAddExpressionsFromSelectList()::__lambda39> (cond=, this=0x7efc500150b8) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxstd.h:1253
#10 QueueCreator_c::MaybeAddExpressionsFromSelectList (this=0x7efc1c4afb20) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxsort.cpp:6430
#11 0x0000000000741ca8 in CreateMultiQueue (dCreators=..., tQueue=..., dQueries=..., dSorters=..., dErrors=..., tRes=..., dExtras=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxsort.cpp:7036
#12 0x000000000074220c in sphCreateMultiQueue (tQueue=..., dQueries=..., dSorters=..., dErrors=..., tRes=..., dExtras=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/sphinxsort.cpp:7152
#13 0x000000000059d53b in SearchHandler_c::CreateMultiQueryOrFacetSorters (this=0x7ef6ec0402b0, pIndex=, dSorters=..., dErrors=..., dExtraSchemas=..., tQueueRes=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/searchd.cpp:5214
#14 0x00000000005b4398 in std::_Function_handler<void(), SearchHandler_c::RunLocalSearchesCoro()::__lambda18>::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/searchd.cpp:5569
#15 0x0000000000969947 in operator() (__closure=0x7efc500de900) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/task_info.cpp:216
#16 std::_Function_handler<void(), myinfo::OwnMini(Threads::Handler)::__lambda14>::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c /4.8.2/functional:2071
#17 0x0000000000965bf7 in WorkerLowest (tCtx=, this=0x7efc500de948) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/coroutine.cpp:74
#18 operator() (pT=..., __closure=0x0) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/coroutine.cpp:95
#19 Threads::CoRoutine_c::CoRoutine_c(std::function<void ()>, unsigned long)::{lambda(boost::context::detail::transfer_t)#1}::_FUN(boost::context::detail::transfer_t) (pT=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/coroutine.cpp:96
#20 0x0000000000965bf7 in CoroWorker_c (pScheduler=, fnHandler=..., this=)
#21 StartPrimary (pScheduler=, fnHandler=...) at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/coroutine.cpp:301
#22 Threads::CoRoutine_c::CoRoutine_c(std::function<void ()>, unsigned long)::{lambda(boost::context::detail::transfer_t)#1}::_FUN(boost::context::detail::transfer_t) () at /usr/src/debug/manticore-3.5.1-200811-04af034-release-rhel7/bin/src_0/src/coroutine.cpp:434
#23 0x000000000096b2ef in make_fcontext ()
#24 0x0000000000000000 in ?? ()

Additional context
query:
SELECT id,vid videoid,allowembed,SNIPPET(subtitles, QUERY(),'before_match=<<<','after_match=>>>','limit=300','around=30','limit_passages=1') snip,title,viewcount,likecount,dislikecount,subnum,subisauto,sublang FROM idx_subtitles_content_dev WHERE MATCH('@subtitles "hello world"') AND subisauto=1 LIMIT 0,50 OPTION max_query_time=3000;

currently trying to rebuild index to see if problem resolves, if more data is needed please suggest.

@popalot2 popalot2 changed the title intermittent core dump when running SQL query intermittent core dump when running query with SNIPPET() function Aug 12, 2020
@popalot2
Copy link
Author

Downgraded to Manticore 3.5.0 1d34c49@200722 , seems that the issue does not longer occur so the bug was introduced sometime since. Will continue monitoring. Seems like the variable pIndex passed into the method is null, I don't have sufficient familiarity with the code base to debug this.

@tomatolog
Copy link
Contributor

could you provide your config and idx_subtitles_content_dev to debug the crash here locally?

Is index plain or RT? Do you reindex your data and rotate that index ?

@popalot2
Copy link
Author

popalot2 commented Aug 12, 2020

could you provide your config and idx_subtitles_content_dev to debug the crash here locally?

Is index plain or RT? Do you reindex your data and rotate that index ?

Do you mean the entire index? It's around 20GB, I don't think it's feasible.
Config file attached (it's generated by the php script). It's a regular index, the error did not happen during reindexing. I've tried reindexing but the errors remained until I downgraded to 3.5.0. I can provide the core file if you have a more private location to upload it to.

@tomatolog
Copy link
Contributor

you might upload all your private data at our write only FTP

ftp: dev.manticoresearch.com
user: manticorebugs
pass: shithappens

However core file might be checked only at box there crash happens that is why it might be useless at our box.

Could you upload your daemon log where more events logged prior and after crash?

@popalot2
Copy link
Author

nice password, uploaded to /github-issue-387. There doesn't seem to be much interesting in the daemon log.

@popalot2
Copy link
Author

popalot2 commented Aug 12, 2020

I will upload the idx_subtitles_content_dev_2 shard shortly, it will be around 2GB compressed. Hopefully it will be sufficient to reproduce the problem.

@popalot2
Copy link
Author

Upload completed into /github-issue-387

@popalot2
Copy link
Author

popalot2 commented Jan 1, 2021

Hello, this issue still happens for me in 3.5.4 13f8d08@201211 , intermittently as before.
Do you have an update on this? Were you able to reproduce the bug?

@sanikolaev
Copy link
Collaborator

@popalot2 Hello. I can't find your config neither in the ftp folder nor here. Can you upload it to the same folder or put it in this issue?

popalot2 added a commit to popalot2/manticoresearch that referenced this issue Jan 4, 2021
test fix for intermittent core dump when running query with SNIPPET() function manticoresoftware#387
due to mutable ExprHook_c m_tHook being shared across all threads, 
when a parallel local query uses SNIPPET there is a race condition, 
where multiple threads do tCtx.m_tHook.SetIndex ( pServed->m_pIndex ) during parsing which causes a core dump
Added a lock to prevent race condition, this should probably have a different hook per thread but I am not familiar enough with the code base to do such a fix.
Tested extensively and the problem no longer occurs, performance impact seems negligible but for a case of a lot of small queries might be noticable.
@popalot2
Copy link
Author

popalot2 commented Jan 4, 2021

Hello @sanikolaev, it seems your ftp ran out of space, I can't upload.

Command: MKD /github-issue-387-update2
Response: 550 /github-issue-387-update2: No space left on device

I've diagnosed the issue and made a test fix here:
popalot2@83afaa5#diff-5b160f2c31d829f0bcc9599cf593bab4c8d3c467a9d573cf008ba240633b5fae

due to mutable ExprHook_c m_tHook being shared across all threads,
when a parallel local query uses SNIPPET there is a race condition,
where multiple threads do tCtx.m_tHook.SetIndex ( pServed->m_pIndex ) during parsing which causes a core dump
Added a lock to prevent race condition, this should probably have a different hook per thread but I am not familiar enough with the code base to do such a fix.
Tested extensively and the problem no longer occurs, performance impact seems negligible but for a case of a lot of small queries might be noticable.

You have this line in your code but it seems you didn't get around to fixing this.

// FIXME!!! breaks for dist threads with SNIPPETS expressions for queries to multiple indexes
mutable ExprHook_c				m_tHook;

@sanikolaev
Copy link
Collaborator

Response: 550 /github-issue-387-update2: No space left on device

Sorry for that, pls try now

I've diagnosed the issue and made a test fix here:

Awesome! Can you make a PR?

@popalot2 popalot2 mentioned this issue Jan 4, 2021
3 tasks
@popalot2
Copy link
Author

popalot2 commented Jan 4, 2021

I think I made PR, I am not experienced with github and not really sure how that works.

uploaded config, index and query to github-issue-387/update_04_01_2021/popalot2bug387update01Jan2021..zip
if you want to reproduce you should run the query repeatedly in a script for a while to have a crash, sometimes it takes a while.
This is just a test fix, I think it should be fixed properly by not having contention between threads during parsing.

@sanikolaev
Copy link
Collaborator

I think I made PR

Right. Here it is #472

I've reproduced the crash on your data/query. Our developers will look into that and into the PR.

@githubmanticore
Copy link
Contributor

githubmanticore commented Jan 5, 2021

➤ Sergey Nikolaev commented:

How to reproduce on the dev server:

  1. Start searchd:
root@dev /home/manticorebugs/ftproot/github-issue-387/update_04_01_2021/upload/conf # searchd -c manticore.conf 
Manticore 3.5.4 13f8d08@201211 release 
Copyright (c) 2001-2016, Andrew Aksyonoff 
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com) 
Copyright (c) 2017-2020, Manticore Software LTD (http://manticoresearch.com) 
 
[58:21.050] [8669] using config file 'manticore.conf' (86208 chars)... 
starting daemon version '3.5.4 13f8d08@201211 release' ... 
listening on 127.0.0.1:9315 for mysql 
precaching index 'idx_all_base' 
WARNING: index 'idx_all_base': prealloc: failed to open ../data/idx_all_base.sph: No such file or directory - NOT SERVING 
precaching index 'idx_manual_subtitles_content_prod_dump_0' 
precaching index 'idx_manual_subtitles_content_prod_dump_1' 
precaching index 'idx_manual_subtitles_content_prod_dump_2' 
precaching index 'idx_manual_subtitles_content_prod_dump_3' 
precaching index 'idx_manual_subtitles_content_prod_dump_4' 
precaching index 'idx_manual_subtitles_content_prod_dump_5' 
precaching index 'idx_manual_subtitles_content_prod_dump_6' 
precaching index 'idx_manual_subtitles_content_prod_dump_7' 
precaching index 'idx_manual_subtitles_content_prod_dump_8' 
precaching index 'idx_manual_subtitles_content_prod_dump_9' 
precaching index 'idx_manual_subtitles_content_prod_dump_10' 
precaching index 'idx_manual_subtitles_content_prod_dump_11' 
precached 12 indexes in 0.075 sec 
  1. Run:
root@dev /home/manticorebugs/ftproot/github-issue-387/update_04_01_2021/upload/conf # while true; do mysql -P9315 -h0 -e "SELECT SNIPPET(subtitles, 'jopa') snip FROM idx_manual_subtitles_content_prod_dump_2" > /dev/null; done; 
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
ERROR 2003 (HY000): Can't connect to MySQL server on '0' (111) 
  1. Find the crashlog in the searchd log:
root@dev /home/manticorebugs/ftproot/github-issue-387/update_04_01_2021/upload/conf # less ../log/searchd.log 

Notes:

  • Crashes in branch 'docnum' as well
  • The index passes indextool --check validation:
root@dev /home/manticorebugs/ftproot/github-issue-387/update_04_01_2021/upload/conf # indextool -c manticore.conf --check idx_manual_subtitles_content_prod_dump_2 
Manticore 3.5.4 13f8d08@201211 release 
Copyright (c) 2001-2016, Andrew Aksyonoff 
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com) 
Copyright (c) 2017-2020, Manticore Software LTD (http://manticoresearch.com) 
 
using config file 'manticore.conf'... 
checking index 'idx_manual_subtitles_content_prod_dump_2'... 
checking schema... 
checking dictionary... 
checking data... 
checking rows... 
checking attribute blocks index... 
checking kill-list... 
checking docstore... 
checking dead row map... 
checking doc-id lookup... 
check passed, 0.2 sec elapsed 

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

Concurrent searches (and comment about them) is not actual for the first glance, as the way it works now is isolated clone of context which belongs to one thread only and so, not to be affected by any concurrency.

The fact that serializing of all searches helps m.b. affected just by the fact of such serialization (as one static mutex made all searches on all threads came throw one serialization point made by mutex), so it needs to be looked deeper which way the index pointer might became zeroed (m.b. moment of cloning is the key).

Anyway, random mutex here is a good hint.

@popalot2
Copy link
Author

➤ Aleksey N. Vinogradov commented:

Concurrent searches (and comment about them) is not actual for the first glance, as the way it works now is isolated clone of context which belongs to one thread only and so, not to be affected by any concurrency.

The fact that serializing of all searches helps m.b. affected just by the fact of such serialization (as one static mutex made all searches on all threads came throw one serialization point made by mutex), so it needs to be looked deeper which way the index pointer might became zeroed (m.b. moment of cloning is the key).

Anyway, random mutex here is a good hint.

I agree that this needs a deeper dig.
I haven't serialized all searches, I only serialized the parsing and for my use case there was no significant impact on performance. Certainly better than having the server core dump for 5-10% of queries.

@tomatolog
Copy link
Contributor

tomatolog commented Jan 11, 2021

I'm against of apply such PR#472 as it clear that dCtx.CloneNewContext(); should clone m_tHook too instead of blocking parallel execution.

Or another option is to limit concurrency to 1 in case m_tHook is present and code for it cloning could not be easily made

@popalot2
Copy link
Author

I'm against of apply such PR#472 as it clear that dCtx.CloneNewContext(); should clone m_tHook too instead of blocking parallel execution.

Or another option is to limit concurrency to 1 in case m_tHook is present and code for it cloning could not be easily made

I concur that this fix should not be merged into the production branch.
When do you expect to have a proper fix?
This bug has been open for almost 5 months.

@tomatolog
Copy link
Contributor

we will fix the issue in the upcoming release

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

I haven't serialized all searches

The patch you provide do it )

One static global variable (lock) means it affects every search despite indexes in use (just by the fact it is global). And it is called on each turn before creating sorters - so each search despite whether it uses snippets or not goes throw the lock.
Anyway, it is matter of another investigation, as apart from the comment you've noticed there is another one in the hook, marked with 'BLOODY HACK' )

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

Strictly speaking, running provided line ... snip FROM idx_manual_subtitles_content_prod_dump_2 should not reproduce the issue as it is single (not distributed) index. And it doesn't reproduce, as expected.

However using true distributed idx_manual_subtitles_content_prod_dump works.

@klirichek
Copy link
Contributor

I've pushed the fix - e8b3b38

If you interesting - reason was in miss-using that hook. And that is explainable by the fact that snippets to distributed indexes wasn't in usecases at all before, so that hook and missusing it was just kind of 'theoretical future issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants