Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results when using NOT IN and IN operators in a query #1123

Closed
hrulik opened this issue May 13, 2023 · 13 comments
Closed

Incorrect results when using NOT IN and IN operators in a query #1123

hrulik opened this issue May 13, 2023 · 13 comments
Labels

Comments

@hrulik
Copy link

hrulik commented May 13, 2023

I am experiencing unexpected results when running a query with NOT IN and IN operators on Manticore Search. The same query works correctly on Sphinx 3. Here is the query I am running:

SELECT id FROM index, delta WHERE id NOT IN (142529) and tag_ids IN (6719,97781,51017,981,25625) GROUP BY id ORDER BY total_ctr DESC LIMIT 0, 21;

The expected result should not include the id 142529, however, the query returns the following result:


| id |

| 142529 |
| 410074 |
| 410077 |
| 410084 |
| 410085 |
| 410097 |
| 410111 |

I have tried several alternatives, such as using id <> 142529, but the results are still incorrect. I would like to understand the reason for this behavior and any possible fixes or workarounds.

Thanks in advance for your help.

  • Manticore Search version: Manticore 6.0.4 1a3a4ea@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)
  • OS version: Centos 8 / Linux 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Tue Jul 20 17:25:16 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

No any error in logs

@tomatolog
Copy link
Contributor

could you issue the query to single index and provide one that produces the bad result?

@hrulik
Copy link
Author

hrulik commented May 13, 2023

could you issue the query to single index and provide one that produces the bad result?

mysql> SELECT id FROM index WHERE id NOT IN (142529) and ANY(tag_ids) IN (6719,97781,51017,981,25625) ORDER BY total_ctr DESC LIMIT 0, 21;
+--------+
| id |
+--------+
| 142529 |
+--------+
1 row in set (0.00 sec)

mysql> SELECT id FROM delta WHERE id NOT IN (142529) and ANY(tag_ids) IN (6719,97781,51017,981,25625) ORDER BY total_ctr DESC LIMIT 0, 21;
+--------+
| id |
+--------+
| 410074 |
| 410077 |
| 410084 |
| 410085 |
| 410097 |
| 410111 |
| 410127 |
| 410134 |
| 410135 |
| 410139 |
| 410141 |
| 410147 |
| 410150 |
| 410193 |
| 410213 |
| 410221 |
| 410233 |
| 410254 |
| 410260 |
+--------+
19 rows in set (0.00 sec)

@tomatolog
Copy link
Contributor

could you upload your index to our s3 as described at our manual https://manual.manticoresearch.com Uploading your data topic ?

I will check issue locally here with data you provided.

@hrulik
Copy link
Author

hrulik commented May 14, 2023

Thank you for your prompt response to my issue. I have uploaded the index and config files to help you better understand and reproduce the problem. Let me know if you also need source

@Korkman
Copy link

Korkman commented May 26, 2023

Hit the same bug. I have a medium size dataset (~ 55k docs, total size of all data files ~145 MiB) and selecting with id NOT IN () is entirely bugged. This seems to affect only the doc id column. Other columns tested: bigint, uint and JSON attributes with similar amounts of unique values are not affected.

manticore versions not affected:

  • 4.0.2
  • 4.2.0
  • 5.0.0
  • 5.0.2

manticore versions affected:

  • 6.0.0
  • 6.0.2
  • 6.0.4 without secondary indexes (columnar library uninstalled)
  • 6.0.4 with secondary indexes (columnar library installed, enforced with /*+ SecondaryIndex(id) */)

Happens with both RT and non-RT indices:

select count(*) from items;
+----------+
| count(*) |
+----------+
|    55186 |
+----------+
select count(*) from items WHERE id NOT IN (123456);
+----------+
| count(*) |
+----------+
|        0 |
+----------+

Various attempts to raise memory limits did not change the outcome:

select count(*) from items WHERE id NOT IN (123456) OPTION max_matches=1000000, max_matches_increase_threshold=1000000, accurate_aggregation=1;

Removing aggregation returns empty result set:

select id from items WHERE id NOT IN (123456) LIMIT 10 OPTION max_matches=1000000, max_matches=1000000, max_matches_increase_threshold=1000000;

DELETE queries are similarly affected, both ways deleting too few and too many documents.

Edit: removed non-relevant SQL query part

@sanikolaev
Copy link
Collaborator

I can reproduce the issue in 6.0.4 with secondary indexes ON:

snikolaev@dev2:~/issue-1123$ mysql -P9315 -h0
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 6.0.4 1a3a4ea82@230314 (secondary 2.0.4 5a49bd7@230306) git branch HEAD (no branch)

...

mysql> SELECT id FROM idx WHERE id NOT IN (142529) and ANY(tag_ids) IN (6719,97781,51017,981,25625) ORDER BY total_ctr DESC LIMIT 0, 21;
+--------+
| id     |
+--------+
| 142529 |
+--------+
1 row in set (0.00 sec)

mysql> set global secondary_indexes=0;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id FROM idx WHERE id NOT IN (142529) and ANY(tag_ids) IN (6719,97781,51017,981,25625) ORDER BY total_ctr DESC LIMIT 0, 21;
+--------+
| id     |
+--------+
| 127718 |
| 352740 |
| 190169 |
| 152047 |
|  74303 |
|  27001 |
| 223883 |
|  72821 |
| 108069 |
|  21919 |
|  77188 |
|  13975 |
|  35102 |
| 258368 |
| 190551 |
|  63791 |
| 162306 |
|  64851 |
|  75328 |
|  67280 |
| 140792 |
+--------+
21 rows in set (0.01 sec)

but can't reproduce it in the latest dev version:

snikolaev@dev2:~/issue-1123$ mysql -P9315 -h0
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 6.0.5 f77ce0e65@230524 dev (columnar 2.0.5 24e76dd@230422) (secondary 2.0.5 24e76dd@230422) git branch HEAD (no branch)

mysql> show variables like 'second%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| secondary_indexes | 1     |
+-------------------+-------+
1 row in set (0.00 sec)

mysql> SELECT id FROM idx WHERE id NOT IN (142529) and ANY(tag_ids) IN (6719,97781,51017,981,25625) ORDER BY total_ctr DESC LIMIT 0, 21;
+--------+
| id     |
+--------+
| 127718 |
| 352740 |
| 190169 |
| 152047 |
|  74303 |
|  27001 |
| 223883 |
|  72821 |
| 108069 |
|  21919 |
|  77188 |
|  13975 |
|  35102 |
| 258368 |
| 190551 |
|  63791 |
| 162306 |
|  64851 |
|  75328 |
|  67280 |
| 140792 |
+--------+
21 rows in set (0.01 sec)

@Korkman @hrulik please check if the issue persists for you in the latest dev version - https://mnt.cr/nightly or https://hub.docker.com/layers/manticoresearch/manticore/dev/images/sha256-22694aef8296fa4a82dad933706c300e9afc1f8d2a3a1f42309a9eb0871ecfe2?context=explore

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label May 26, 2023
@githubmanticore
Copy link
Contributor

➤ Ilya Kuznetsov commented:

Turning secondary indexes on or off is just a hint for the query optimizer. It might decide to not use them even if they are available. Please provide show meta output so we can see which code path is really used.

@githubmanticore
Copy link
Contributor

➤ Ilya Kuznetsov commented:

There are 4 different code paths to process docid filters, so it would be nice to see which one does not work as expected.

@Korkman
Copy link

Korkman commented May 26, 2023

Nightly fixes the problem! Still testing but looks good.

@Korkman
Copy link

Korkman commented May 26, 2023

6.0.4:

searchd -v
Manticore 6.0.4 1a3a4ea82@230314 (columnar 2.0.4 5a49bd7@230306) (secondary 2.0.4 5a49bd7@230306)

SphinxQL> select count(*) from items WHERE id > 0 and id NOT IN (123456);
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0,003 sec)

SphinxQL> show meta;
+----------------+--------------------------------------------+
| Variable_name  | Value                                      |
+----------------+--------------------------------------------+
| total          | 0                                          |
| total_found    | 0                                          |
| total_relation | eq                                         |
| time           | 0.003                                      |
| index          | id:DocidIndex (100%), id:DocidIndex (100%) |
+----------------+--------------------------------------------+

dev: 6.0.5-230526-f5cd92166:

searchd -v
Manticore 6.0.5 f5cd92166@230526 dev (columnar 2.0.5 24e76dd@230422) (secondary 2.0.5 24e76dd@230422)

SphinxQL> select count(*) from items WHERE id > 0 and id NOT IN (123457);
+----------+
| count(*) |
+----------+
|    55186 |
+----------+
1 row in set (0,007 sec)

SphinxQL> show meta;
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| total          | 1     |
| total_found    | 1     |
| total_relation | eq    |
| time           | 0.006 |
+----------------+-------+
4 rows in set (0,001 sec)

Edit: added searchd -v

@githubmanticore
Copy link
Contributor

➤ Ilya Kuznetsov commented:

So the issue is that 6.0.4 uses DocidIndex and returns incorrect results while 6.0.5 decides not to use DocidIndex, returns correct results but query becomes 2x slower. Adding /*+ DocidIndex(id) */ will likely force 6.0.5 to return incorrect results.

@Korkman
Copy link

Korkman commented May 26, 2023

Thought so, too, but I couldn't enforce index usage. Only when using IN instead of NOT IN it will use the index.

@githubmanticore
Copy link
Contributor

➤ Ilya Kuznetsov commented:

Latest dev version (3b62be2) has a lot more warnings about why trying to force a secondary index (or DocidIndex(id)) doesn't work. In this case an exclude filter (NOT IN) is currently not supported by secondary indexes and it is impossible to use them. So this bug is not present in 6.0.5.

@githubmanticore githubmanticore removed the waiting Waiting for the original poster (in most cases) or something else label May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants