Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group By on JSON Data Causes Primary To Crash #1265

Open
flakone2010 opened this issue Jul 20, 2023 · 8 comments
Open

Group By on JSON Data Causes Primary To Crash #1265

flakone2010 opened this issue Jul 20, 2023 · 8 comments
Labels
waiting Waiting for the original poster (in most cases) or something else

Comments

@flakone2010
Copy link

flakone2010 commented Jul 20, 2023

When performing a group by query against JSON structured data (which is treated like a facet) across 4 distributed nodes, the primary server making the call crashes when grouping the results from the nodes. The individual nodes successfully return their data to the primary server. The last query in the trace attached below is the query that causes the Manticore instance to crash.

The primary server does not host any local indexes.

Environment:
Manticore 6.0.4 1a3a4ea@230314
Ubuntu 20.04
Clients connect to mysql which is set on port 9312
Agents listen on port 9315

-- crashed SphinxQL request dump ---  
SELECT id   FROM RewardTranslationIndexEn WHERE  MATCH('@gpc_keyword _61983_')  AND   
ANY(gallery_partner_catalog_id)=61983  AND is_searchable=1  GROUP BY reward_group_id   
WITHIN GROUP ORDER BY total_cost ASC  ORDER BY name ASC  LIMIT 0, 40  OPTION max_matches=150000,  
 field_weights=(model=10, brand=10, name=6, description=2) ; SHOW META; SELECT groupby()   
AS category_facets__t2level1, count(*)    FROM RewardTranslationIndexEn  WHERE  MATCH('@gpc_keyword   
_61983_')  AND ANY(gallery_partner_catalog_id)=61983  AND is_searchable=1  GROUP BY   
category_facets.catalog_6407._t2level1  ORDER BY category_facets.catalog_6407._t2level1   
ASC  LIMIT 0, 2000   OPTION max_matches=150000, field_weights=(model=10, brand=10,  
 name=6, description=2) ;SELECT groupby() AS category_facets__t2level2, count(*)    
  FROM RewardTranslationIndexEn  WHERE  MATCH('@gpc_keyword _61983_')  AND ANY(gallery_partner_catalog_id)=61983   
 AND is_searchable=1  GROUP BY category_facets.catalog_6407._t2level2  ORDER BY category_facets.catalog_6407._t2level2   
ASC  LIMIT 0, 2000   OPTION max_matches=150000, field_weights=(model=10, brand=10,  
 name=6, description=2) ;SELECT groupby() AS category_facets__t2level3, count(*)    
  FROM RewardTranslationIndexEn  WHERE  MATCH('@gpc_keyword _61983_')  AND ANY(gallery_partner_catalog_id)=61983   
 AND is_searchable=1  GROUP BY category_facets.catalog_6407._t2level3  ORDER BY category_facets.catalog_6407._t2level3   
ASC  LIMIT 0, 2000   OPTION max_matches=150000, field_weights=(model=10, brand=10,  
 name=6, description=2) ;SELECT groupby() AS category_facets__t2level4, count(*)    
  FROM RewardTranslationIndexEn  WHERE  MATCH('@gpc_keyword _61983_')  AND ANY(gallery_partner_catalog_id)=61983   
 AND is_searchable=1  GROUP BY category_facets.catalog_6407._t2level4  ORDER BY category_facets.catalog_6407._t2level4   
ASC  LIMIT 0, 2000   OPTION max_matches=150000, field_weights=(model=10, brand=10,  
 name=6, description=2) ;SELECT groupby() AS category_facets__t2level10, count(*)   
   FROM RewardTranslationIndexEn  WHERE  MATCH('@gpc_keyword _61983_')  AND ANY(gallery_partner_catalog_id)=61983   
 AND is_searchable=1  GROUP BY category_facets.catalog_6407._t2level10  ORDER BY category_facets.catalog_6407._t2level10   
ASC  LIMIT 0, 2000   OPTION max_matches=150000, field_weights=(model=10, brand=10,  
 name=6, description=2) ;  
--- request dump end ---  
--- local index:  
Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)  
Handling signal 11  
-------------- backtrace begins here ---------------  
Program compiled with Clang 15.0.4  
Configured with flags: Configured with these definitions: -DDISTR_BUILD=focal -DUSE_SYSLOG=1 -DWITH_GALERA=1 -DWITH_RE2=1 -DWITH_RE2_FORCE_STATIC=1 -DWITH_STEMMER=1 -DWITH_STEMMER_FORCE_STATIC=1 -DWITH_ICU=1 -DWITH_ICU_FORCE_STATIC=1 -DWITH_SSL=1 -DWITH_ZLIB=1 -DWITH_ZSTD=1 -DDL_ZSTD=1 -DZSTD_LIB=libzstd.so.1 -DWITH_CURL=1 -DDL_CURL=1 -DCURL_LIB=libcurl.so.4 -DWITH_ODBC=1 -DDL_ODBC=1 -DODBC_LIB=libodbc.so.2 -DWITH_EXPAT=1 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DWITH_ICONV=1 -DWITH_MYSQL=1 -DDL_MYSQL=1 -DMYSQL_LIB=libmysqlclient.so.21 -DWITH_POSTGRESQL=1 -DDL_POSTGRESQL=1 -DPOSTGRESQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/lib/manticore/data -DFULL_SHARE_DIR=/usr/share/manticore  
Built on Linux aarch64 for Linux x86_64 (focal)  
Stack bottom = 0x7faa50043fa0, thread stack size = 0x20000  
Trying manual backtrace:  
Something wrong with thread stack, manual backtrace may be incorrect (fp=0x1)  
Wrong stack limit or frame pointer, manual backtrace failed (fp=0x1, stack=0x7faa50040000, stacksize=0x20000)  
Trying system backtrace:  
begin of system symbols:  
/usr/bin/searchd(_Z12sphBacktraceib 0x22a)[0x5570c4b9e16a]  
/usr/bin/searchd(_ZN11CrashLogger11HandleCrashEi 0x355)[0x5570c4a61715]  
/lib/x86_64-linux-gnu/libpthread.so.0( 0x14420)[0x7faa6522e420]  
/usr/bin/searchd(_ZN14QueueCreator_c20ReplaceJsonWithExprsER24CSphMatchComparatorStateRN3sph8Vector_TI15ExtraSortExpr_tNS2_13DefaultCopy_TIS4_EENS2_14DefaultRelimitENS2_16DefaultStorage_TIS4_EEEE 0x270)[0x5570c4baf630]  
/usr/bin/searchd(_ZN14QueueCreator_c10RemapAttrsER24CSphMatchComparatorStateRN3sph8Vector_TI15ExtraSortExpr_tNS2_13DefaultCopy_TIS4_EENS2_14DefaultRelimitENS2_16DefaultStorage_TIS4_EEEE 0x48)[0x5570c4baf9a8]  
/usr/bin/searchd(_ZN14QueueCreator_c21SetupGroupSortingFuncEb 0x316)[0x5570c4baff76]  
/usr/bin/searchd(_ZN14QueueCreator_c15SetGroupSortingEv 0x2e)[0x5570c4bb047e]  
/usr/bin/searchd(_Z14sphCreateQueueRK18SphQueueSettings_tRK9CSphQueryR10CSphStringR13SphQueueRes_tPN3sph8Vector_TIS5_NS9_13DefaultCopy_TIS5_EENS9_14DefaultRelimitENS9_16DefaultStorage_TIS5_EEEEP14QueryProfile_c 0x55)[0x5570c4bb1db5]  
/usr/bin/searchd(_Z18MinimizeAggrResultR12AggrResult_tRK9CSphQuerybRKN3sph9StringSetEP14QueryProfile_cPK18CSphFilterSettingsbb 0x2af6)[0x5570c4a70f96]  
/usr/bin/searchd(_ZN15SearchHandler_c9RunSubsetEii 0x1229)[0x5570c4a79649]  
/usr/bin/searchd(_ZN15SearchHandler_c10RunQueriesEv 0xd4)[0x5570c4a751e4]  
/usr/bin/searchd(_Z17HandleMysqlSelectR11RowBuffer_iR15SearchHandler_c 0x1ec)[0x5570c4a9a95c]  
/usr/bin/searchd(_Z20HandleMysqlMultiStmtRKN3sph8Vector_TI9SqlStmt_tNS_13DefaultCopy_TIS1_EENS_14DefaultRelimitENS_16DefaultStorage_TIS1_EEEER19CSphQueryResultMetaR11RowBuffer_iRK10CSphString 0x3d0)[0x5570c4a9dd40]  
/usr/bin/searchd(_ZN15ClientSession_c7ExecuteESt4pairIPKciER11RowBuffer_i 0x822)[0x5570c4aa87f2]  
/usr/bin/searchd(_Z20ProcessSqlQueryBuddySt4pairIPKciERhR16ISphOutputBuffer 0x47)[0x5570c4a0d507]  
/usr/bin/searchd(_Z8SqlServeSt10unique_ptrI16AsyncNetBuffer_cSt14default_deleteIS0_EE 0x10b8)[0x5570c49f99c8]  
/usr/bin/searchd(_Z10MultiServeSt10unique_ptrI16AsyncNetBuffer_cSt14default_deleteIS0_EESt4pairIitE7Proto_e 0x43)[0x5570c49f5853]  
/usr/bin/searchd( 0x7883e4)[0x5570c49f63e4]  
/usr/bin/searchd(_ZZN7Threads11CoRoutine_c13CreateContextESt8functionIFvvEE11VecTraits_TIhEENUlN5boost7context6detail10transfer_tEE_8__invokeES9_ 0x1c)[0x5570c536a19c]  
/usr/bin/searchd(make_fcontext 0x37)[0x5570c5388ec7]  
Trying boost backtrace:  
 0# sphBacktrace(int, bool) in /usr/bin/searchd  
 1# CrashLogger::HandleCrash(int) in /usr/bin/searchd  
 2# 0x00007FAA6522E420 in /lib/x86_64-linux-gnu/libpthread.so.0  
 3# QueueCreator_c::ReplaceJsonWithExprs(CSphMatchComparatorState&, sph::Vector_T<ExtraSortExpr_t, sph::DefaultCopy_T<ExtraSortExpr_t>, sph::DefaultRelimit, sph::DefaultStorage_T<ExtraSortExpr_t> >&) in /usr/bin/searchd  
 4# QueueCreator_c::RemapAttrs(CSphMatchComparatorState&, sph::Vector_T<ExtraSortExpr_t, sph::DefaultCopy_T<ExtraSortExpr_t>, sph::DefaultRelimit, sph::DefaultStorage_T<ExtraSortExpr_t> >&) in /usr/bin/searchd  
 5# QueueCreator_c::SetupGroupSortingFunc(bool) in /usr/bin/searchd  
 6# QueueCreator_c::SetGroupSorting() in /usr/bin/searchd  
 7# sphCreateQueue(SphQueueSettings_t const&, CSphQuery const&, CSphString&, SphQueueRes_t&, sph::Vector_T<CSphString, sph::DefaultCopy_T<CSphString>, sph::DefaultRelimit, sph::DefaultStorage_T<CSphString> >*, QueryProfile_c*) in /usr/bin/searchd  
 8# MinimizeAggrResult(AggrResult_t&, CSphQuery const&, bool, sph::StringSet const&, QueryProfile_c*, CSphFilterSettings const*, bool, bool) in /usr/bin/searchd  
 9# SearchHandler_c::RunSubset(int, int) in /usr/bin/searchd  
10# SearchHandler_c::RunQueries() in /usr/bin/searchd  
11# HandleMysqlSelect(RowBuffer_i&, SearchHandler_c&) in /usr/bin/searchd  
12# HandleMysqlMultiStmt(sph::Vector_T<SqlStmt_t, sph::DefaultCopy_T<SqlStmt_t>, sph::DefaultRelimit, sph::DefaultStorage_T<SqlStmt_t> > const&, CSphQueryResultMeta&, RowBuffer_i&, CSphString const&) in /usr/bin/searchd  
13# ClientSession_c::Execute(std::pair<char const*, int>, RowBuffer_i&) in /usr/bin/searchd  
14# ProcessSqlQueryBuddy(std::pair<char const*, int>, unsigned char&, ISphOutputBuffer&) in /usr/bin/searchd  
15# SqlServe(std::unique_ptr<AsyncNetBuffer_c, std::default_delete<AsyncNetBuffer_c> >) in /usr/bin/searchd  
16# MultiServe(std::unique_ptr<AsyncNetBuffer_c, std::default_delete<AsyncNetBuffer_c> >, std::pair<int, unsigned short>, Proto_e) in /usr/bin/searchd  
17# 0x00005570C49F63E4 in /usr/bin/searchd  
18# Threads::CoRoutine_c::CreateContext(std::function<void ()>, VecTraits_T<unsigned char>)::{lambda(boost::context::detail::transfer_t)#1}::__invoke(boost::context::detail::transfer_t) in /usr/bin/searchd  
19# make_fcontext in /usr/bin/searchd  
  
-------------- backtrace ends here ---------------  
Please, create a bug report in our bug tracker (https://github.com/manticoresoftware/manticore/issues)  
and attach there:  
a) searchd log, b) searchd binary, c) searchd symbols.  
Look into the chapter 'Reporting bugs' in the manual  
(https://manual.manticoresearch.com/Reporting_bugs)  
Dump with GDB via watchdog  
--- active threads ---  
thd 0 (work_1), proto mysql, state query, command select  
--- Totally 2 threads, and 1 client-working threads ---  
------- CRASH DUMP END -------  
@flakone2010
Copy link
Author

Additional information.

I removed the ORDER BY clause from the query and it started working.

Does this mean there could be bad data somewhere and its causing it to crash? Hopefully the crash dump above points you in the right direction.

Thank You

@tomatolog
Copy link
Contributor

could you provide remote index with configs along with master index config to reproduce this crash locally and investigation?

@githubmanticore githubmanticore added the waiting Waiting for the original poster (in most cases) or something else label Jul 24, 2023
@flakone2010
Copy link
Author

Unfortunately, I am not authorized to upload our indexes. I did try to recreate the issue on our UAT environment by processing the data of the catalog there, but was unable to recreate this issue. For now, moving the sorting into the application will have to do.

I will close the ticket as nothing further can be diagnosed or provided.

@tomatolog
Copy link
Contributor

could you check every your local index from the distributed RewardTranslationIndexEn with indextool indextool -c your.conf --check local_name to make sure that crash is not from bad index?

could you check that crash still persist at recent release 620 at your UAT environment ?

@flakone2010
Copy link
Author

flakone2010 commented Aug 10, 2023

I ran that as well previously without issue... here is the output for one of four nodes. Each one passed successfully.

`indextool -c /etc/manticoresearch/manticore.conf --check RewardTranslationIndexEn
Manticore 6.0.4 1a3a4ea@230314
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)

using config file '/etc/manticoresearch/manticore.conf'...
WARNING: secondary library not loaded; secondary index(es) disabled
checking table 'RewardTranslationIndexEn'...
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 18.3 sec elapsed`

Unfortunately, as to updating our UAT environment, we can't recreate the issue even after importing the data to it. So updating to latest dev build won't help us here. It is difficult to recreate the issue, otherwise I would have created a paired down index for you to test against.

I will make one general observation. We have only ever had issues in the past with queries when JSON data is being filtered on (Like in this instance). Usually minor in nature, but they caused manticore to crash, we just always worked around it. The last issue was when the JSON object key started with a number, it would cause a crash. But when we prefixed the object key with an underscore '_', it fixed it Eg: "category_facets.catalog_6407._48938dhd3872d". In this current instance however, the values are string readable ASCII characters, so nothing special in the data that needs to be sorted on. I even forced the JSON key/array data structures to match sequentially in case there was some optimization happening in the background that may have caused an out of bounds access to happen.

Maybe the issue lives in the code that takes the JSON data and starts to sort on it. Perhaps there is some optimization making assumptions on accessing the JSON structure? I just don't know.

If you have any other questions, please let me know. In any case, We look forward to your next release.

@tomatolog
Copy link
Contributor

The issue with JSON numeric keys was fixed at 6dd3964 and is a part of the 6.2.0-release

However you need reindex your data to get issue fixed after package upgrade.

For the crash you now reported I also tried but unable to recreate it on simplified data. Please reopen issue if you see more crashes or have a reproducible example that you could share.

@flakone2010
Copy link
Author

flakone2010 commented Aug 14, 2023

So we tried upgrading our production instances to 6.2.0 on Friday evening and our crashes got worse. The query itself was not crashing it, but the search instances were running out of memory about once an hour. Here are some server logs. We also do not see anything in the search logs because the kernel is killing the process, so coredump is never triggered. The log below is also for a single node. If you can think of anything for this then let us know, as of right now we had to roll back to Manticore 6.0.4.

Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151768] [1130973]   117 1130973    13456       67    94208      545             0 searchd
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151770] [1130974]   117 1130974 12720149  5892465 53063680     6064             0 searchd
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151772] [1130981]   117 1130981    11516      460   126976     3449             0 manticore-execu
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151773] [1169122]     0 1169122      652       87    45056       21             0 sh
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151775] [1169123]     0 1169123     1972      509    57344      113             0 sudo
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151776] [1169128]   117 1169128   367395    23321  2949120   329686             0 indexer
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151778] [1173350]   118 1173350     9609      571    57344       44             0 pickup
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151779] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/manticore.service,task=searchd,pid=1130974,uid=117
Aug 14 00:14:45 grs-manticore-primary4 kernel: [180285.151801] Out of memory: Killed process 1130974 (searchd) total-vm:50880596kB, anon-rss:23569024kB, file-rss:836kB, shmem-rss:0kB, UID:117 pgtables:51820kB oom_score_adj:0
Aug 14 00:14:50 grs-manticore-primary4 searchd[1173854]: [Mon Aug 14 00:14:50.953 2023] [1173854] using config file '/etc/manticoresearch/manticore.conf' (71590 chars)...
Aug 14 00:14:51 grs-manticore-primary4 searchd[1173854]: [Mon Aug 14 00:14:50.993 2023] [1173854] FATAL: stop: kill() on pid 1130974 failed: No such process
Aug 14 00:14:51 grs-manticore-primary4 searchd[1173854]: Manticore 6.2.0 45680f95d@230804 (columnar 2.2.0 dc33868@230804) (secondary 2.2.0 dc33868@230804)
Aug 14 00:14:51 grs-manticore-primary4 searchd[1173854]: Copyright (c) 2001-2016, Andrew Aksyonoff
Aug 14 00:14:51 grs-manticore-primary4 searchd[1173854]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com/)
Aug 14 00:14:51 grs-manticore-primary4 searchd[1173854]: Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com/)
Aug 14 00:14:51 grs-manticore-primary4 systemd[1]: manticore.service: Control process exited, code=exited, status=1/FAILURE
Aug 14 00:14:51 grs-manticore-primary4 systemd[1]: manticore.service: Failed with result 'exit-code'.
Aug 14 00:14:51 grs-manticore-primary4 systemd[1]: manticore.service: Scheduled restart job, restart counter is at 1.
Aug 14 00:14:51 grs-manticore-primary4 systemd[1]: Stopped Manticore Search Engine.
Aug 14 00:14:51 grs-manticore-primary4 systemd[1]: Starting Manticore Search Engine...
[root@grs-manticore-primary4 ~]# du -sh /var/lib/manticore/data/indexes/
6.7G	/var/lib/manticore/data/indexes/
[root@grs-manticore-primary4 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          24007        6284        8808           1        8914       17321
Swap:          1953           0        1953

Thank You,

@flakone2010 flakone2010 reopened this Aug 14, 2023
@sanikolaev
Copy link
Collaborator

The query itself was not crashing it, but the search instances were running out of memory about once an hour

Does it happen while the select query from the original post is being executed?

Anyway the easiest way to solve this issue seems to be if you upload your index files, config and the query which causes a problem to our write-only S3, so we can reproduce and inspect it locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting Waiting for the original poster (in most cases) or something else
Projects
None yet
Development

No branches or pull requests

4 participants