Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misaligned weight() values across cluster nodes #1111

Closed
pavelnemirovsky opened this issue May 5, 2023 · 22 comments
Closed

Misaligned weight() values across cluster nodes #1111

pavelnemirovsky opened this issue May 5, 2023 · 22 comments
Assignees
Labels

Comments

@pavelnemirovsky
Copy link

pavelnemirovsky commented May 5, 2023

Describe the bug
We have a cluster consisting of 3 nodes, and I have been running the same command on each node. However, I have noticed that I am getting different scores as results. Could you please provide an explanation for this phenomenon?

To Reproduce
Steps to reproduce the behavior:

# mysql -h manticore-01.dmetrics.internal -P 9306  
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1602 |      1820 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  
  
# mysql -h manticore-02.dmetrics.internal -P 9306  
  
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1604 |      1855 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  
  
# mysql -h manticore-03.dmetrics.internal -P 9306  
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200;  
 ----------- ----------- ----------   
| min_score | max_score | count(*) |  
 ----------- ----------- ----------   
|      1602 |      1752 |    35274 |  
 ----------- ----------- ----------   
1 row in set (0.14 sec)  

Expected behavior
Expected idempotent results across all cluster nodes.

Describe the environment:

  • Manticore Search version (top line in output of bin/searchd -v or bin/indexer -v):
Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)  
Copyright (c) 2001-2016, Andrew Aksyonoff  
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)  
Copyright (c) 2017-2023, Manticore Software LTD (https://manticoresearch.com)  
@tomatolog
Copy link
Contributor

could you copy index files from nodes where values differ for testing?

Could you check that disk chunks count the same?
Could you issue your query without pseudo sharding \ option threads = 1 to check the result ?

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label May 5, 2023
@pavelnemirovsky
Copy link
Author

@tomatolog

Config:

# https://github.com/manticoresoftware/manticoresearch/blob/master/manual/Server_settings/Searchd.md#node_address
common {
    # https://manual.manticoresearch.com/Server_settings/Common#lemmatizer_base
    lemmatizer_base = /usr/share/manticore/nlp/

    # https://manual.manticoresearch.com/Server_settings/Common#progressive_merge
    # progressive_merge =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_keynames
    # json_autoconv_keynames =

    # https://manual.manticoresearch.com/Server_settings/Common#json_autoconv_numbers
    # json_autoconv_numbers = 0

    # https://manual.manticoresearch.com/Server_settings/Common#on_json_attr_error
    # on_json_attr_error = ignore_attr

    # plugin_dir =
}

searchd {
  server_id = manticore-03.dmetrics.internal
  listen = 127.0.0.1:9306:mysql
  listen = 127.0.0.1:9308:http
  listen = 10.0.82.150:9312
  listen = 10.0.82.150:9306:mysql
  listen = 10.0.82.150:9308:http
  listen = 10.0.82.150:9360-9370:replication
  pid_file = /var/run/manticore/searchd.pid
  max_packet_size = 128M
  binlog_flush = 1 # ultimate safety, low speed
  query_log_format = sphinxql
  mysql_version_string = 5.0.37
  data_dir = /var/lib/data/manticore
  binlog_path = /var/lib/data/manticore
  log = /var/log/manticore/searchd.log
  query_log = /var/log/manticore/query.log
  attr_flush_period = 60
  # Flushing RT RAM chunks each 5 min
  rt_flush_period = 300
  preopen_indexes = 0
  not_terms_only_allowed = 1
}

Single Thread:

# mysql -h manticore-01.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1603 |      1769 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139585                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 19916679448                                                                                                |
| disk_bytes                  | 90059646941                                                                                                |
| disk_mapped                 | 23217024968                                                                                                |
| disk_mapped_cached          | 18921156608                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 995385488                                                                                                  |
| ram_chunk_segments_count    | 28                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 93.72%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_15min            | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}     |
| query_time_total            | {"queries":33512, "avg_sec":0.078, "min_sec":0.001, "max_sec":0.161, "pct95_sec":0.112, "pct99_sec":0.126} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_15min            | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":33512, "avg":98644, "min":0, "max":192349, "pct95":141003, "pct99":156775}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+

# mysql -h manticore-02.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1604 |      1773 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

mysql> SHOW INDEX fgi_prod STATUS;
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139585                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 19832797718                                                                                                |
| disk_bytes                  | 90049699488                                                                                                |
| disk_mapped                 | 23232896729                                                                                                |
| disk_mapped_cached          | 18917761024                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 914899342                                                                                                  |
| ram_chunk_segments_count    | 28                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 95.00%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_15min            | {"queries":1, "avg_sec":0.001, "min_sec":0.001, "max_sec":0.001, "pct95_sec":0.001, "pct99_sec":0.001}     |
| query_time_total            | {"queries":32779, "avg_sec":0.087, "min_sec":0.001, "max_sec":0.294, "pct95_sec":0.154, "pct99_sec":0.185} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_15min            | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":32779, "avg":99523, "min":0, "max":205374, "pct95":141003, "pct99":192349}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+

# mysql -h manticore-03.dmetrics.internal -P 9306
mysql> select min(weight()) as min_score, max(weight()) as max_score, count(*) from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 option threads = 1;
+-----------+-----------+----------+
| min_score | max_score | count(*) |
+-----------+-----------+----------+
|      1602 |      1782 |    36392 |
+-----------+-----------+----------+
1 row in set (0.15 sec)

+-----------------------------+------------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                         |
| indexed_documents           | 14139586                                                                                                   |
| indexed_bytes               | 74543841680                                                                                                |
| ram_bytes                   | 18465132812                                                                                                |
| disk_bytes                  | 90025432598                                                                                                |
| disk_mapped                 | 23290852358                                                                                                |
| disk_mapped_cached          | 17833938944                                                                                                |
| disk_mapped_doclists        | 0                                                                                                          |
| disk_mapped_cached_doclists | 0                                                                                                          |
| disk_mapped_hitlists        | 0                                                                                                          |
| disk_mapped_cached_hitlists | 0                                                                                                          |
| killed_documents            | 0                                                                                                          |
| killed_rate                 | 0.00%                                                                                                      |
| ram_chunk                   | 631056516                                                                                                  |
| ram_chunk_segments_count    | 29                                                                                                         |
| disk_chunks                 | 32                                                                                                         |
| mem_limit                   | 2147483648                                                                                                 |
| mem_limit_rate              | 93.18%                                                                                                     |
| ram_bytes_retired           | 0                                                                                                          |
| tid                         | 145999                                                                                                     |
| tid_saved                   | 145999                                                                                                     |
| query_time_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| query_time_5min             | {"queries":1, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002}     |
| query_time_15min            | {"queries":2, "avg_sec":0.002, "min_sec":0.002, "max_sec":0.002, "pct95_sec":0.002, "pct99_sec":0.002}     |
| query_time_total            | {"queries":33076, "avg_sec":0.079, "min_sec":0.001, "max_sec":0.213, "pct95_sec":0.117, "pct99_sec":0.130} |
| found_rows_1min             | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                                   |
| found_rows_5min             | {"queries":1, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_15min            | {"queries":2, "avg":1, "min":1, "max":1, "pct95":1, "pct99":1}                                             |
| found_rows_total            | {"queries":33076, "avg":97746, "min":0, "max":192349, "pct95":141003, "pct99":156775}                      |
+-----------------------------+------------------------------------------------------------------------------------------------------------+

@tomatolog
Copy link
Contributor

could you issue these queries at all nodes and provide results sets from every node?

select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;

seems your index got different data however it is not clear how it got into such state and what is the difference.

@pavelnemirovsky
Copy link
Author

@tomatolog

Node #1

root@manticore-01:~# mysql -h0 -P9306

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 2592984993446342688 |  1603 |
| 4662220671500427390 |  1603 |
| 3129968330644052828 |  1603 |
| 3300479849082479870 |  1603 |
| 4265425771834879270 |  1603 |
| 6161236734625651487 |  1603 |
| 2972218032951538573 |  1603 |
| 2230410452443263952 |  1603 |
| 5426145563682917845 |  1603 |
| 5443436492395652225 |  1603 |
| 8123348241364266641 |  1603 |
| 1209174545758837824 |  1603 |
| 2717219431518028807 |  1603 |
| 1585252769924152895 |  1603 |
| 5552902210274204647 |  1603 |
| 1980593527404316223 |  1603 |
| 1481372055290767673 |  1603 |
| 7427455941233620045 |  1603 |
| 2506307317936132088 |  1603 |
| 8752126551063260322 |  1603 |
+---------------------+-------+
20 rows in set (0.02 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 8200229092940547929 |  1604 |
| 2120942744610469628 |  1604 |
| 6923035898495548477 |  1604 |
| 6601064453919589115 |  1604 |
| 6735606320642119811 |  1604 |
| 2999792835472411662 |  1604 |
| 3654574537062196888 |  1604 |
| 3773264351158118416 |  1604 |
| 8637578311057562584 |  1604 |
| 8794164784460149655 |  1604 |
| 6059877165305848918 |  1604 |
| 3282906016831256748 |  1604 |
| 2612261472663055387 |  1604 |
| 2690877269429919795 |  1604 |
| 2847097192829816046 |  1604 |
| 5942945371882851678 |  1604 |
| 2474220990740776572 |  1604 |
|  733163232415140441 |  1604 |
| 2513753592768196340 |  1604 |
| 8773490989255665275 |  1604 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 4873968605738893745 |  1602 |
| 2171153676825932229 |  1602 |
| 3622786119719300160 |  1602 |
| 3049536713905632928 |  1602 |
| 6107640006935923479 |  1602 |
|  801981398053586315 |  1602 |
| 1960698561719059790 |  1602 |
| 1343761795668256013 |  1602 |
| 4622984669984555319 |  1602 |
| 1545080798186495944 |  1602 |
| 2098591267684858063 |  1602 |
| 4497645853033140285 |  1602 |
| 8946551953652792236 |  1602 |
| 3248487345969809604 |  1602 |
| 2919116455063626548 |  1602 |
| 8727190087137721890 |  1602 |
| 2513347835042179078 |  1602 |
| 5636863017215536826 |  1602 |
| 3172480916545321926 |  1602 |
| 7111135492437432234 |  1602 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)

Node #2

root@manticore-02:~# mysql -h0 -P9306

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 8892694543726138219 |  1602 |
|  629335885271522399 |  1602 |
| 1456549821258029774 |  1602 |
| 7249953765864106116 |  1602 |
| 2786707208645515768 |  1602 |
| 5261557492145362518 |  1602 |
| 2833376159662097650 |  1602 |
| 8595731973147110735 |  1602 |
| 2253945765483571809 |  1602 |
| 4997605205840460661 |  1602 |
| 4319783924984481239 |  1602 |
| 4315258191512044085 |  1602 |
|  371245259292705244 |  1602 |
| 4189238752045049628 |  1602 |
| 4177900041508324215 |  1602 |
| 5487895812895329158 |  1602 |
| 1083832472836028683 |  1602 |
| 2432922389897552564 |  1602 |
| 1665163043726892822 |  1602 |
| 1831702418817682063 |  1602 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.00 sec)

Node #3

root@manticore-03:~# mysql -h0 -P9306

mysql> ¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
ERROR 1064 (42000): sphinxql: syntax error, unexpected $end near '¬select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603'
mysql>
mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1603;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 2592984993446342688 |  1603 |
| 4662220671500427390 |  1603 |
| 3129968330644052828 |  1603 |
| 3300479849082479870 |  1603 |
| 4265425771834879270 |  1603 |
| 6161236734625651487 |  1603 |
| 2972218032951538573 |  1603 |
| 2230410452443263952 |  1603 |
| 5426145563682917845 |  1603 |
| 5443436492395652225 |  1603 |
| 8123348241364266641 |  1603 |
| 1209174545758837824 |  1603 |
| 2717219431518028807 |  1603 |
| 1585252769924152895 |  1603 |
| 5552902210274204647 |  1603 |
| 1980593527404316223 |  1603 |
| 1481372055290767673 |  1603 |
| 7427455941233620045 |  1603 |
| 2506307317936132088 |  1603 |
| 8752126551063260322 |  1603 |
+---------------------+-------+
20 rows in set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1769;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1604;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 3278299437838253517 |  1604 |
|  831363264692743711 |  1604 |
| 4097074020702646877 |  1604 |
|  502386133080193092 |  1604 |
| 2438302740824218206 |  1604 |
| 2434101175870422770 |  1604 |
| 6705524661416092133 |  1604 |
| 1567232811339767193 |  1604 |
| 3999944616016432641 |  1604 |
| 1306726073512346320 |  1604 |
| 5297381330670067881 |  1604 |
| 9017267096864895454 |  1604 |
| 8994162228783625519 |  1604 |
| 7889636414790414500 |  1604 |
| 4636895361539966838 |  1604 |
| 8006576869797910500 |  1604 |
|  430197624393949370 |  1604 |
| 3386716024388158741 |  1604 |
| 4022663738521285142 |  1604 |
| 3264531810452662099 |  1604 |
+---------------------+-------+
20 rows in set (0.00 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1773;
Empty set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1602;
+---------------------+-------+
| id                  | score |
+---------------------+-------+
| 4873968605738893745 |  1602 |
| 2171153676825932229 |  1602 |
| 3622786119719300160 |  1602 |
| 3049536713905632928 |  1602 |
| 6107640006935923479 |  1602 |
|  801981398053586315 |  1602 |
| 1960698561719059790 |  1602 |
| 1343761795668256013 |  1602 |
| 4622984669984555319 |  1602 |
| 1545080798186495944 |  1602 |
| 2098591267684858063 |  1602 |
| 4497645853033140285 |  1602 |
| 8946551953652792236 |  1602 |
| 3248487345969809604 |  1602 |
| 2919116455063626548 |  1602 |
| 8727190087137721890 |  1602 |
| 2513347835042179078 |  1602 |
| 5636863017215536826 |  1602 |
| 3172480916545321926 |  1602 |
| 7111135492437432234 |  1602 |
+---------------------+-------+
20 rows in set (0.01 sec)

mysql> select id, weight() as score from fgi_prod where match('cnn') and publish_date between 1680469200 and 1683061200 and score=1782;
Empty set (0.01 sec)

@pavelnemirovsky
Copy link
Author

@tomatolog sorry for delay

@sanikolaev
Copy link
Collaborator

sanikolaev commented May 27, 2023

Document weight is a function of multiple variables including IDF which depends on the number of documents in the table/disk chunk/ram chunk and in case of an RT index it's calculated separately for each disk chunk, so even without replication the same query against the same documents can give different weights and even different documents order depending on the distribution of the documents in the chunks:

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'),('a b'),('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.03 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b c'),('a b'),('a a a a a')
--------------

Query OK, 3 rows affected (0.01 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 1515364055206854670 | a b c     |     1319 |
| 1515364055206854671 | a b       |     1319 |
| 1515364055206854672 | a a a a a |     1180 |
+---------------------+-----------+----------+
3 rows in set (0.01 sec)

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a b c')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a b')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a a a a a')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 1515364055206854675 | a a a a a |     1819 |
| 1515364055206854673 | a b c     |     1680 |
| 1515364055206854674 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

We should consider integration of https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#global_idf into RT indexes.

@githubmanticore githubmanticore changed the title Misaligned weight() values across cluster nodes Misaligned weight() values across cluster nodes May 27, 2023
@githubmanticore githubmanticore removed the waiting Waiting for the original poster (in most cases) or something else label May 27, 2023
@sanikolaev
Copy link
Collaborator

There's also local_df and it's not working for RT tables:

mysql> drop table if exists t; create table t(f text); insert into t(f) values('a b c'); flush ramchunk t; insert into t(f) values('a b'); flush ramchunk t; insert into t(f) values('a a a a a'); flush ramchunk t; select *, weight() from t where match('a'); select *, weight() from t where match('a') option local_df=1; optimize table t option sync=1, cutoff=1; select *, weight() from t where match('a');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.04 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b c')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('a b')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t(f) values('a a a a a')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
flush ramchunk t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a |     1819 |
| 5838818623640043542 | a b c     |     1680 |
| 5838818623640043543 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

--------------
select *, weight() from t where match('a') option local_df=1
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043544 | a a a a a |     1819 |
| 5838818623640043542 | a b c     |     1680 |
| 5838818623640043543 | a b       |     1680 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

--------------
optimize table t option sync=1, cutoff=1
--------------

Query OK, 0 rows affected (0.16 sec)

--------------
select *, weight() from t where match('a')
--------------

+---------------------+-----------+----------+
| id                  | f         | weight() |
+---------------------+-----------+----------+
| 5838818623640043542 | a b c     |     1319 |
| 5838818623640043543 | a b       |     1319 |
| 5838818623640043544 | a a a a a |     1180 |
+---------------------+-----------+----------+
3 rows in set (0.00 sec)

It looks most promising to make it work for RT and perhaps make it a default or expose it as searchd.local_df.

@sanikolaev
Copy link
Collaborator

The local_df issue was solved here #1436

The global_idf is to be discussed and estimated.

@alexiv1965
Copy link

alexiv1965 commented Jul 31, 2024

Steps to reproduce global_idf issue (Manticore v 6.3.0):

  1. Prepare global.idf file from some working index by indextool. The only requirement is - that should be index, different from following test index, but with known keywords with idf values (which should differ to idf of keywords from test index). For the following test the only word 'bag' is needed with some non-zero idf value.
  2. Baseline:
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/path/to/global.idf'*/;
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');

SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    /*global_idf=1,*/
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand,
    weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('bag')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    /*global_idf=1,*/
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;
  1. Baseline results (corresponding selects):
id      title   brand   score   text_features
1677721600007   Crossbody Bag with Tassel       Burberry        11843   {"bm25":891, "bm25a":0.75644487, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]}

id      title   brand   score   text_features
1677721600008   Some other bag  Gucci   2960    {"bm25":695, "bm25a":0.63685048, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}                                                                                                                                      
1677721600007   Crossbody Bag with Tassel       Burberry        2631    {"bm25":695, "bm25a":0.62822247, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.43067655, "min_idf":0.21533827, "max_idf":0.21533827, "sum_idf":0.21533827, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.21533827, "atc":0.000000}], "words":[{"tf":2, "idf":0.21533827}]}

-- mention idf values in text_features column
5. DROP TABLE products;
6. Test:

CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/path/to/global.idf';
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');

SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('bag')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;
  1. Test results:
id      title   brand   score   text_features
1677721600011   Crossbody Bag with Tassel       Burberry        0       {"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

id      title   brand   score   text_features
1677721600012   Some other bag  Gucci   0       {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}
1677721600011   Crossbody Bag with Tassel       Burberry        0       {"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

-- mention all idf values in text_features are zero.
8. Control:

show create table products;

CREATE TABLE products (
id bigint,
title text,
brand text
) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'

-- path to global_idf is not even stored in index, so, after restart of searchd we'll get absolutely the same results with zero idf.

@tomatolog
Copy link
Contributor

tomatolog commented Aug 29, 2024

I can not reproduce issue you described. I need complete example that I could recreate locally to investigate the issue.
I created test table that used for indexing and prepare global_idf gh1111.zip then posted data into MySQL

mysql -u test test < gh1111.sql

then use half of the data for actual table then other half data for global_idf table there some words matched

source src_gidf1
{
  type      = mysql
  	sql_host		= 127.0.0.1
	sql_user		= root
	sql_pass		= 
	sql_port		= 3306
	sql_db		= test

  sql_query     = SELECT * FROM products where id<20
}


index gidf1
{
    source      = src_gidf1
    path      = data/gidf1
    charset_table=russian, 0..9, english, _
    index_field_lengths = 1
    index_exact_words = 1
    morphology = lemmatize_ru_all,lemmatize_en_all
    global_idf = gidf2.idf
}

source src_gidf2
{
  type      = mysql
  	sql_host		= 127.0.0.1
	sql_user		= root
	sql_pass		= 
	sql_port		= 3306
	sql_db		= test

  sql_query     = SELECT * FROM products where id>20
}


index gidf2
{
    source      = src_gidf2
    path      = data/gidf2
    charset_table=russian, 0..9, english, _
    index_field_lengths = 1
    index_exact_words = 1
    morphology = lemmatize_ru_all,lemmatize_en_all
    global_idf = gidf2.idf
}

then indexed data and created global_idf

indexer.exe -c d1.conf gidf1 gidf2
indextool.exe -c d1.conf --dumpdict gidf1 --stats > gidf1_src.txt
indextool.exe -c d1.conf --dumpdict gidf2 --stats > gidf2_src.txt
indextool.exe -c d1.conf --buildidf gidf1_src.txt --out gidf1.idf
indextool.exe -c d1.conf --buildidf gidf2_src.txt --out gidf2.idf

then for queries with global idf enabled q1111-1.zip I see correct idf values

mysql -h 127.0.0.1 -P 9306 -vvv < q1111-1.sql
--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   11 | Crossbody Bag with Tassel | Burberry | 4         | 1         |     0 | {"bm25":891, "bm25a":0.70031464, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.86135310, "min_idf":0.43067655, "max_idf":0.43067655, "sum_idf":0.43067655, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.43067655, "atc":0.000000}], "words":[{"tf":2, "idf":0.43067655}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('burberry') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf1 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   12 | Some other bag            | Gucci    | 3         | 1         |  1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
|   11 | Crossbody Bag with Tassel | Burberry | 4         | 1         |   946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

--------------
SELECT *, weight() as score, packedfactors({no_atc=1, json=1}) as text_features FROM gidf2 WHERE MATCH('bag') LIMIT 0,200 OPTION idf='plain,tfidf_unnormalized', global_idf=1, ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))')
--------------

+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id   | title                     | brand    | title_len | brand_len | score | text_features                                                                                                                                                                                                                                                                                                                                                                                                              |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   22 | uome otheu bag            | Gucui    | 3         | 1         |  1149 | {"bm25":632, "bm25a":0.57499516, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":3, "min_best_span_pos":3, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
|   21 | Crousbody Bag wuth Tausel | Burbeury | 4         | 1         |   946 | {"bm25":632, "bm25a":0.56801891, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.29248124, "min_idf":0.14624062, "max_idf":0.14624062, "sum_idf":0.14624062, "min_hit_pos":2, "min_best_span_pos":2, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.14624062, "atc":0.000000}], "words":[{"tf":2, "idf":0.14624062}]} |
+------+---------------------------+----------+-----------+-----------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

I need complete example with global idf file and source data that I could run locally and investigate issue as for me all works fine now

@tomatolog tomatolog added the waiting Waiting for the original poster (in most cases) or something else label Aug 29, 2024
@alexiv1965
Copy link

Hi, I'm just return from vacations, will prepare response in several days. But it's strange for me: in Manticore 6.3.0 the path to global_idf file is not stored anywhere in index at all, so, index table just do not know that it should use it. There is no global_idf option in create table of gh1111.zip file. The following your example uses indexing tool, so, you use plain tables. But my example deals with real-time tables!

@tomatolog
Copy link
Contributor

you could change my example to get the reproducible case or create your own that shows the issue. For now I see no issue and show you that all works as intend .

@alexiv1965
Copy link

alexiv1965 commented Sep 5, 2024

But in my comments here of 31 July there are exact steps to reproduce the problem. Pay attention on p.6: there the problem with real-time table! As to '/path/to/global.idf' - it may be built on any real-time table, even from baseline p.2. since the problem is that real-time table ignores documented real-time table creation option global_idf = '/path/to/global.idf' and do not stores this path anywhere in real-time table. Your example is not about my case completely, since it uses indexer.exe and hence - plain tables, not real-time table.

@tomatolog
Copy link
Contributor

that is why I asked you about complete case as I tried to reproduce the case (I tried the RT indexes and plain indexes) and see no issue. I posted plain indexes case as it has all data to start with and simper setup.

If you see the issue with RT indexes please post all files along with commands or maybe a Docker container that I could run locally and see the issue.

@tomatolog
Copy link
Contributor

the only thing that I tried the case while daemon works in the plain mode and all setup was done via config. Maybe your case related to only RT mode but I ask you to provide complete case with all files or commands to continue investigation.

@alexiv1965
Copy link

OK, a bit later.

@alexiv1965
Copy link

alexiv1965 commented Sep 11, 2024

So, improved version of global_idf problem reproduction steps. The problem concerns only to real-time tables.

  1. Baseline (to compare with problem):
CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' /*global_idf = '/tmp/global.idf'*/;
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
INSERT INTO products(title,brand) VALUES ('Another bag without Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Шлите яблоки', 'Краснодар');
FLUSH RAMCHUNK products;

(Take note - no global.idf file is used, real-time table creation)

  1. Baseline check:
SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    /*global_idf=1,*/
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand,
    weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('шли')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    /*global_idf=1,*/
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

(no global.idf in baseline requests)

  1. Baseline results:
id	title	brand	score	text_features
795472708505698309	Another bag without Tassel	Burberry	7762	{"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]}
795472708505698305	Crossbody Bag with Tassel	Burberry	7762	{"bm25":756, "bm25a":0.66703403, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.56457508, "min_idf":0.28228754, "max_idf":0.28228754, "sum_idf":0.28228754, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.28228754, "atc":0.000000}], "words":[{"tf":2, "idf":0.28228754}]}

id	title	brand	score	text_features
795472708505698308	Шлите апельсины	Марокко	3025	{"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]}
795472708505698310	Шлите яблоки	Краснодар	3025	{"bm25":628, "bm25a":0.62708396, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]}
795472708505698307	Шла собока по рояле	Ризеншнауцер	2094	{"bm25":709, "bm25a":0.60401660, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.24755955, "min_idf":0.24755955, "max_idf":0.24755955, "sum_idf":0.24755955, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.24755955, "atc":0.000000}], "words":[{"tf":1, "idf":0.24755955}]}

(Take note on idf and bm25 reasonable values in text_features column)

@alexiv1965
Copy link

alexiv1965 commented Sep 11, 2024

global_idf.zip
4. Prepare global.idf file from baseline working index by indextool. The only requirement is - that should be index, different from following test index, but with known keywords with idf values (which should differ to idf of keywords from test index).

ssh manticore_server
cd /var/lib/manticore
indextool --dumpdict products/products.0.spi --stats > products.txt
indextool --buildidf products.txt --out global.idf

(indextool --dumpdict products --stats (with table name) just didn't work, so undocumented filename option is used. It's another problem in indextool to solve. And indextool also didn't work when there is any other distributed table in manticore.json - it's the third problem)

@alexiv1965
Copy link

  1. DROP TABLE products;

  2. Problem reproduction:

CREATE TABLE products(title text, brand text) index_field_lengths='1' index_exact_words = '1' 
    morphology = 'lemmatize_ru_all,lemmatize_en_all' global_idf = '/var/lib/manticore/global.idf';
INSERT INTO products(title,brand) VALUES ('Crossbody Bag with Tassel', 'Burberry');
INSERT INTO products(title,brand) VALUES ('Some other bag', 'Gucci');
INSERT INTO products(title,brand) VALUES ('Шла собока по рояле', 'Ризеншнауцер');
INSERT INTO products(title,brand) VALUES ('Шлите апельсины', 'Марокко');
FLUSH RAMCHUNK products;

(Take notice: smaller table, than in p.1, global.idf file is specified)

  1. Problem requests:
SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('burberry')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

SELECT id, title, brand, weight() as score,
    packedfactors({no_atc=1, json=1}) as text_features
FROM products
WHERE MATCH('шли')
LIMIT 0,200
OPTION
    max_matches=200,
    idf='plain,tfidf_unnormalized',
    global_idf=1,
    ranker=expr('(20.0*(1000*bm25f(1.2,0.9999,{title=1,brand=2})-500.0))'), /* any ranker with bm25 */
    max_query_time=600;

(Take notice: global_idf option specified in requests)

  1. Problem results:
id	title	brand	score	text_features
795475224534450181	Crossbody Bag with Tassel	Burberry	0	{"bm25":500, "bm25a":0.500000, "field_mask":2, "doc_word_count":1, "fields":[{"field":1, "lcs":1, "hit_count":2, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":1, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":2, "idf":0.000000}]}

id	title	brand	score	text_features
795475224534450183	Шла собока по рояле	Ризеншнауцер	0	{"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]}
795475224534450184	Шлите апельсины	Марокко	0	{"bm25":500, "bm25a":0.500000, "field_mask":1, "doc_word_count":1, "fields":[{"field":0, "lcs":1, "hit_count":1, "word_count":1, "tf_idf":0.000000, "min_idf":0.000000, "max_idf":0.000000, "sum_idf":0.000000, "min_hit_pos":1, "min_best_span_pos":1, "exact_hit":0, "max_window_hits":1, "min_gaps":0, "exact_order":1, "lccs":1, "wlccs":0.000000, "atc":0.000000}], "words":[{"tf":1, "idf":0.000000}]}

(Take notice: all idf values in text_features are zero, bm25 values are significantly different (and erroneous) from baseline)

@alexiv1965
Copy link

alexiv1965 commented Sep 11, 2024

  1. Control:
show create table products;

CREATE TABLE products (
id bigint,
title text,
brand text
) index_exact_words='1' index_field_lengths='1' morphology='lemmatize_ru_all,lemmatize_en_all'

(Take notice: path to global.idf is not even stored in index, so, after restart of searchd we'll get absolutely the same bad results with zero idf.)

@tomatolog tomatolog removed the waiting Waiting for the original poster (in most cases) or something else label Sep 12, 2024
@tomatolog
Copy link
Contributor

I've just fixed global_idf issues at the RT index at 1611667

You need recreate your index to get issue fixed.

I also fixed indextool to work with dumpdict cli for the case you provided.

If you have any issues with usage of the global_idf or indextool please open new issues as this is very long and many cases are not completely related to the initial description

@sanikolaev sanikolaev added the rel::upcoming Upcoming release label Sep 16, 2024
@alexiv1965
Copy link

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants