Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash indexer on morphology line in config #395

Closed
cappadaan opened this issue Aug 18, 2020 · 24 comments
Closed

Crash indexer on morphology line in config #395

cappadaan opened this issue Aug 18, 2020 · 24 comments
Labels

Comments

@cappadaan
Copy link

cappadaan commented Aug 18, 2020

Manticore Search version: 3.5.0
OS: CentOS 7

I am trying to move from Sphinx to Manticore.
This specific line in my Sphinx (3.2.1) config causes a crash on the Manticore indexer:

morphology = libstemmer_dutch, lemmatize_en

Report

*** Oops, indexer crashed! Please send the following report to developers.
Manticore 3.5.0 1d34c49@200722 release
-------------- report begins here ---------------
Current document: docid=23196, hits=1135108
Current batch: minid=225047, maxid=225211
Hit pool start: docid=0, hit=0
-------------- backtrace begins here ---------------
Program compiled with 4.8.5
Configured with flags: Configured by CMake with these definitions: -DCMAKE_BUILD_TYPE=RelWithDebInfo -DDISTR_BUILD=rhel7 -DUSE_SSL=ON -DDL_UNIXODBC=1 -DUNIXODBC_LIB=libodbc.so.2 -DDL_EXPAT=1 -DEXPAT_LIB=libexpat.so.1 -DUSE_LIBICONV=1 -DDL_MYSQL=1 -DMYSQL_LIB=libmysqlclient.so.18 -DDL_PGSQL=1 -DPGSQL_LIB=libpq.so.5 -DLOCALDATADIR=/var/data -DFULL_SHARE_DIR=/usr/share/manticore -DUSE_RE2=1 -DUSE_ICU=1 -DUSE_BISON=ON -DUSE_FLEX=ON -DUSE_SYSLOG=1 -DWITH_EXPAT=1 -DWITH_ICONV=ON -DWITH_MYSQL=1 -DWITH_ODBC=ON -DWITH_PGSQL=1 -DWITH_RE2=1 -DWITH_STEMMER=1 -DWITH_ZLIB=ON -DGALERA_SONAME=libgalera_manticore.so.31 -DSYSCONFDIR=/etc/manticoresearch
Host OS is Linux runner-fa6cab46-project-3858465-concurrent-0 4.19.78-coreos #1 SMP Mon Oct 14 22:56:39 -00 2019 x86_64 x86_64 x86_64 GNU/Linux
Stack bottom = 0x0, thread stack size = 0x20000
Trying system backtrace:
begin of system symbols:
indexer(_Z12sphBacktraceib+0x90)[0x614e80]
indexer(_Z7sigsegvi+0xa2)[0x55c1c2]
/lib64/libpthread.so.0(+0xf630)[0x7fd0794a7630]
/lib64/libc.so.6(+0x13ee07)[0x7fd0783cbe07]
indexer(_ZN14CSphHitBuilder7cidxHitEP16CSphAggregateHit+0x37e)[0x5792ae]
indexer(_ZN13CSphIndex_VLN5BuildERKN3sph8Vector_TIP10CSphSourceNS0_13DefaultCopy_TIS3_EENS0_14DefaultRelimitENS0_16DefaultStorage_TIS3_EEEEii+0x22e4)[0x5bd1a4]
indexer(_Z7DoIndexRK17CSphConfigSectionPKcRK15CSphOrderedHashIS_10CSphString15CSphStrHashFuncLi256EEbP8_IO_FILE+0x15cb)[0x56303b]
indexer(main+0x13b7)[0x55a787]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd0782af555]
indexer[0x55c05f]

@cappadaan cappadaan changed the title Crash on morphology line in config Crash indexer on morphology line in config Aug 18, 2020
@cappadaan
Copy link
Author

Changing the line to

morphology = libstemmer_nl, lemmatize_en

has the same result (crash)

@sanikolaev
Copy link
Collaborator

Hi

Can you provide more details on how to reproduce the crash as I can't reproduce it like this:

snikolaev@dev:~$ cat dutch.conf
common
{
    lemmatizer_base =  .
}

source dutch
{
    type = mysql
    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_attr_uint = attr
    sql_query = select 1, 'De wind waait door de hoge bomen' body, 1 attr
}

index idx
{
    path = dutch
    source = dutch
    morphology = libstemmer_dutch, lemmatize_en
}

searchd
{
    listen = 9314
    log = sphinx.log
    pid_file = 9314.pid
}
snikolaev@dev:~$ indexer -c dutch.conf --all
Manticore 3.5.0 1d34c49@200722 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2020, Manticore Software LTD (http://manticoresearch.com)

using config file 'dutch.conf'...
indexing index 'idx'...
collected 1 docs, 0.0 MB
creating lookup: 0.0 Kdocs, 100.0% done
creating histograms: 0.0 Kdocs, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 1 docs, 32 bytes
total 0.043 sec, 736 bytes/sec, 23.01 docs/sec
total 18 reads, 0.000 sec, 92.1 kb/call avg, 0.0 msec/call avg
total 13 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
snikolaev@dev:~$ indextool -c dutch.conf --dumpdict idx
Manticore 3.5.0 1d34c49@200722 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2020, Manticore Software LTD (http://manticoresearch.com)

using config file 'dutch.conf'...
dumping dictionary for index 'idx'...
keyword,docs,hits,offset
bom,1,1,6
de,1,2,11
dor,1,1,26
hog,1,1,16
waait,1,1,21
wind,1,1,1

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Aug 19, 2020
@cappadaan
Copy link
Author

It seems more complicated than my first post.

I just uploaded the config (please replace the PHP vars manually) + the DB data (run sql) + stopwords

If I remove any of the lines:

stopwords = sphinx_stopwords.txt
morphology = libstemmer_nl, lemmatize_en
index_exact_words = 1

the indexer crashes.

@adriannuta
Copy link
Contributor

From what I see it's the exact morphology = libstemmer_nl, lemmatize_en (using en stemmer instead of lemmatizer is fine) that makes the condition for the crash and because of some particularity in the data (not sure if specific words or because there are some 4-byte chars in it).

@adriannuta adriannuta added bug and removed waiting Waiting for the original poster (in most cases) or something else labels Aug 19, 2020
@cappadaan
Copy link
Author

Is there a workaround to use this morphology without a crash?

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

try this one:

--- src/sphinx.cpp	(revision b6beb3b4c4f469b019a807054a517bb29305db0d) 
+++ src/sphinx.cpp	(date 1597832282079) 
@@ -9298,6 +9298,15 @@ 
 	m_dSkiplist.Resize ( 0 ); 
 } 
  
+static int strcmpp (const char* l, const char* r) 
+{ 
+	const char* szEmpty = ""; 
+	if ( !l ) 
+		l = szEmpty; 
+	if ( !r ) 
+		r = szEmpty; 
+	return strcmp ( l, r ); 
+} 
  
 void CSphHitBuilder::cidxHit ( CSphAggregateHit * pHit ) 
 { 
@@ -9309,7 +9318,7 @@ 
 	// next word 
 	///////////// 
  
-	const bool bNextWord = ( m_tLastHit.m_uWordID!=pHit->m_uWordID ||	( m_pDict->GetSettings().m_bWordDict && strcmp ( (char*)m_tLastHit.m_sKeyword, (char*)pHit->m_sKeyword ) ) ); // OPTIMIZE? 
+	const bool bNextWord = ( m_tLastHit.m_uWordID!=pHit->m_uWordID ||	( m_pDict->GetSettings().m_bWordDict && strcmpp ( (char*)m_tLastHit.m_sKeyword, (char*)pHit->m_sKeyword ) ) ); // OPTIMIZE? 
 	const bool bNextDoc = bNextWord || ( m_tLastHit.m_tRowID!=pHit->m_tRowID ); 
  
 	if ( m_bGotFieldEnd && ( bNextWord || bNextDoc ) ) 
 

@adriannuta
Copy link
Contributor

Seems to work, not getting the crash anymore.

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

Actually doesn't. Produced index has no ft part, only full-scan filtering by 'title' possible.
Reason is in source which provides no 'id' column.

@klirichek
Copy link
Contributor

@cappadaan you anyway has to fix your config/sources
Data from source must contain first column 'id', integer type. That is document-id, and it is mandatory (at least now).
Without that column index is useless even if indexer didn't crashed.
Read https://manual.manticoresearch.com/Adding_data_from_external_storages/Fetching_from_databases/Indexing_fetched_data#Indexing-fetched-data for details

@cappadaan
Copy link
Author

I already have this. This config is just a completely stripped one to show the crash. Adding a ID does not fix this.

@klirichek
Copy link
Contributor

Well, ok. Let's split the issues then.

  1. Over your data crash eliminated by the patch above.
  2. However index produced by indexer with patch is useless. And I assume that is because your test data contains no 'id'. So, invalid index is not the consequence of the crash or fix, but consequence of id-less example.

@cappadaan
Copy link
Author

Is this a final patch? will it be included in the next version, so my version does not break while updating?

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

If @adriannuta or you confirm if fixes the problem - yes, see no reason to change anything in it.

@cappadaan
Copy link
Author

Im using yum, so I have no idea how to check if this patch works

@klirichek
Copy link
Contributor

well, ok, I've just pushed 59d94ce with the fix.
You can check our 'nightly' repo - CI pipeline usually takes about half-an-hour to complete (and that will create package and put it to the repo).

klirichek added a commit that referenced this issue Aug 19, 2020
That should most probably fix github #395
@cappadaan
Copy link
Author

I still have no idea how to use this, can you be more specific how?

@klirichek
Copy link
Contributor

That is it: https://manual.manticoresearch.com/Installation/RHEL_and_Centos#Installing-Manticore-packages-on-RedHat-and-CentOS

current 'dev' repo is just builds from master branch. You may pick latest rpm with this fix from that repo.

@klirichek
Copy link
Contributor

@cappadaan
Copy link
Author

Got it working, thx!

I can confirm the patch works. I just indexed the full Sphinx 3.2.1 config without a crash.

@cappadaan
Copy link
Author

In the indexer it says

"Copyright (c) 2017-2020, Manticore Software LTD (http://manticoresearch.com)"

should be

"Copyright (c) 2017-2020, Manticore Software LTD (https://manticoresearch.com)"

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

Good point! Thank you!

@cappadaan
Copy link
Author

cappadaan commented Aug 19, 2020

Got all new sorts of unknow warnings (never saw these in sphinx) while trying to start searchd

[13:42.085] [25205] using config file '/home/bla.conf' (22783 chars)...
[13:42.086] [25205] WARNING: TCP fast open unavailable (can't read /proc/sys/net/ipv4/tcp_fastopen, unsupported kernel?)
listening on all interfaces for mysql, port=2473

and

[Wed Aug 19 19:18:30.435 2020] [25217] WARNING: internal error: non-empty queue on a rotation cycle start, got 1 elements
[Wed Aug 19 19:18:30.435 2020] [25217] WARNING: queue[] = template_index

@githubmanticore
Copy link
Contributor

➤ Aleksey N. Vinogradov commented:

First is warning to you. We support tcp-fast-open. If you enable it, master-agent communications will be improved, and also connection via http, if client supports, will be improved also. All is ready, only enable support on system level necessary. So, you're welcome to use it. or you may silently ignore the warning; that is opportunity, not error.

second, I guess, is just matter of the way you write config. Actually we have dedicated index 'type = template' which has no source, no files and just a container to inherit from (or also may be used to generate snippets, since they need no real index). You use kind of 'template_index' instead, which is usual plain index, but incomplete (no datasource, no storage). It works in general, but this warning will always be issued about it, since daemon just has no idea whether it is intentionally incomplete, or that is just a 'brick' to build another index.

@sanikolaev
Copy link
Collaborator

I'm closing the issue as the crash is fixed. Feel free to reopen if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants