-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misaligned weight() values across cluster nodes #1111
Comments
could you copy index files from nodes where values differ for testing? Could you check that disk chunks count the same? |
Config:
Single Thread:
|
could you issue these queries at all nodes and provide results sets from every node?
seems your index got different data however it is not clear how it got into such state and what is the difference. |
Node #1
Node #2
Node #3
|
@tomatolog sorry for delay |
Document weight is a function of multiple variables including IDF which depends on the number of documents in the table/disk chunk/ram chunk and in case of an RT index it's calculated separately for each disk chunk, so even without replication the same query against the same documents can give different weights and even different documents order depending on the distribution of the documents in the chunks: mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'),('a b'),('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.03 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
insert into t(f) values('a b c'),('a b'),('a a a a a')
--------------
Query OK, 3 rows affected (0.01 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 1515364055206854670 | a b c | 1319 |
| 1515364055206854671 | a b | 1319 |
| 1515364055206854672 | a a a a a | 1180 |
+---------------------+-----------+----------+
3 rows in set (0.01 sec)
mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
create table t(f text)
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a b c')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a b')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
insert into t(f) values('a a a a a')
--------------
Query OK, 1 row affected (0.00 sec)
--------------
flush ramchunk t
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
select *, weight() from t where match('a')
--------------
+---------------------+-----------+----------+
| id | f | weight() |
+---------------------+-----------+----------+
| 1515364055206854675 | a a a a a | 1819 |
| 1515364055206854673 | a b c | 1680 |
| 1515364055206854674 | a b | 1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec) We should consider integration of https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#global_idf into RT indexes. |
There's also local_df and it's not working for RT tables:
It looks most promising to make it work for RT and perhaps make it a default or expose it as |
The local_df issue was solved here #1436 The global_idf is to be discussed and estimated. |
Steps to reproduce global_idf issue (Manticore v 6.3.0):
-- mention idf values in text_features column
-- mention all idf values in text_features are zero.
-- path to global_idf is not even stored in index, so, after restart of |
I can not reproduce issue you described. I need complete example that I could recreate locally to investigate the issue.
then use half of the data for actual table then other half data for global_idf table there some words matched
then indexed data and created global_idf
then for queries with global idf enabled q1111-1.zip I see correct idf values
I need complete example with global idf file and source data that I could run locally and investigate issue as for me all works fine now |
Hi, I'm just return from vacations, will prepare response in several days. But it's strange for me: in Manticore 6.3.0 the path to global_idf file is not stored anywhere in index at all, so, index table just do not know that it should use it. There is no global_idf option in create table of gh1111.zip file. The following your example uses indexing tool, so, you use plain tables. But my example deals with real-time tables! |
you could change my example to get the reproducible case or create your own that shows the issue. For now I see no issue and show you that all works as intend . |
But in my comments here of 31 July there are exact steps to reproduce the problem. Pay attention on p.6: there the problem with real-time table! As to '/path/to/global.idf' - it may be built on any real-time table, even from baseline p.2. since the problem is that real-time table ignores documented real-time table creation option global_idf = '/path/to/global.idf' and do not stores this path anywhere in real-time table. Your example is not about my case completely, since it uses indexer.exe and hence - plain tables, not real-time table. |
that is why I asked you about complete case as I tried to reproduce the case (I tried the RT indexes and plain indexes) and see no issue. I posted plain indexes case as it has all data to start with and simper setup. If you see the issue with RT indexes please post all files along with commands or maybe a Docker container that I could run locally and see the issue. |
the only thing that I tried the case while daemon works in the plain mode and all setup was done via config. Maybe your case related to only RT mode but I ask you to provide complete case with all files or commands to continue investigation. |
OK, a bit later. |
So, improved version of global_idf problem reproduction steps. The problem concerns only to real-time tables.
(Take note - no global.idf file is used, real-time table creation)
(no global.idf in baseline requests)
(Take note on idf and bm25 reasonable values in text_features column) |
global_idf.zip
(indextool --dumpdict products --stats (with table name) just didn't work, so undocumented filename option is used. It's another problem in indextool to solve. And indextool also didn't work when there is any other distributed table in manticore.json - it's the third problem) |
(Take notice: smaller table, than in p.1, global.idf file is specified)
(Take notice: global_idf option specified in requests)
(Take notice: all idf values in text_features are zero, bm25 values are significantly different (and erroneous) from baseline) |
(Take notice: path to global.idf is not even stored in index, so, after restart of searchd we'll get absolutely the same bad results with zero idf.) |
I've just fixed global_idf issues at the RT index at 1611667 You need recreate your index to get issue fixed. I also fixed indextool to work with If you have any issues with usage of the global_idf or indextool please open new issues as this is very long and many cases are not completely related to the initial description |
Many thanks! |
Describe the bug
We have a cluster consisting of 3 nodes, and I have been running the same command on each node. However, I have noticed that I am getting different scores as results. Could you please provide an explanation for this phenomenon?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Expected idempotent results across all cluster nodes.
Describe the environment:
bin/searchd -v
orbin/indexer -v
):uname -a
if on a Unix-like system): Linux manticore-03.dmetrics.internal 5.4.0-1097-aws Request:field_length
Field-level ranking variable #105~18.04.1-Ubuntu SMP Mon Feb 13 17:50:57 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: