Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange count distinct/group by on json field behavior #369

Closed
mvollsen opened this issue Jul 6, 2020 · 4 comments
Closed

Strange count distinct/group by on json field behavior #369

mvollsen opened this issue Jul 6, 2020 · 4 comments
Labels

Comments

@mvollsen
Copy link

mvollsen commented Jul 6, 2020

Describe the environment

Manticore Search version (top line in output of bin/searchd -v or bin/indexer -v):

Manticore 3.4.3 2e1484c@200701 release

OS version (uname -a if on a Unix-like system):

Linux 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Describe the problem:

The count distinct/group by result on a JSON/Data field returns wrong results

Steps to reproduce:

Your MySQL connection id is 1    
Server version: 5.0.37    
    
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.    
    
Oracle is a registered trademark of Oracle Corporation and/or its    
affiliates. Other names may be trademarks of their respective    
owners.    
    
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.    
    
mysql> REPLACE INTO statistic1(id,identitytype,data) VALUES (3817,17,'{"id":38,"clusters":[2,3,5],"isStudyOfTheWeek":false,"searchEntity":"Report","title":"Report 1"}');    
Query OK, 1 row affected (0,00 sec)    
    
mysql> REPLACE INTO statistic1(id,identitytype,data) VALUES (3617,17,'{"id":36,"clusters":[2,3,5],"isStudyOfTheWeek":false,"searchEntity":"Report","title":"Report 2"}');    
Query OK, 1 row affected (0,00 sec)    
    
mysql> select * from statistic1;    
*************************** 1. row ***************************    
                id: 3617    
      identitytype: 17    
              data: {"id":36,"clusters":[2,3,5],"isStudyOfTheWeek":false,"searchEntity":"Report","title":"Report 2"}    
*************************** 2. row ***************************    
                id: 3817    
      identitytype: 17    
              data: {"id":38,"clusters":[2,3,5],"isStudyOfTheWeek":false,"searchEntity":"Report","title":"Report 1"}    
2 rows in set (0,00 sec)    

We're usually expect the distinct count return 2, but it returns 1

mysql> SELECT identitytype as identity, data.id cntent, COUNT(DISTINCT cntent) entitycounter FROM statistic1;    
*************************** 1. row ***************************    
     identity: 17    
       cntent: 36    
entitycounter: 1    
1 row in set (0,00 sec)    
@mvollsen
Copy link
Author

mvollsen commented Jul 6, 2020

In addition, the group by behaves differently on Sphinx3/Manticore compared to Sphinx2 (We are not able to reproduce it on a small basis appropriate for this scope and it is blocked by this issue)

Please give us a note, if you need a separate issue for this.

Sphinx2:

mysql> SELECT identitytype as identity, data.id cntent, COUNT(DISTINCT cntent) entitycounter FROM study WHERE identity IN (17);
*************************** 1. row ***************************
     identity: 17
       cntent: 17
entitycounter: 380
1 row in set (0,00 sec)
mysql> SELECT identitytype as identity, data.id cntent, COUNT(DISTINCT cntent) entitycounter FROM study WHERE identity IN (17) GROUP BY identity;
*************************** 1. row ***************************
     identity: 17
       cntent: 17
entitycounter: 380
1 row in set (0,01 sec)
mysql> exit
Bye
$ searchd --version
Sphinx 2.2.11-id64-release (95ae9a6)

Sphinx3

mysql> SELECT identitytype as identity, data.id cntent, COUNT(DISTINCT cntent) entitycounter FROM study WHERE identity IN (17);
*************************** 1. row ***************************
     identity: 17
       cntent: 25910
entitycounter: 400
1 row in set (0,12 sec)
mysql> SELECT identitytype as identity, data.id cntent, COUNT(DISTINCT cntent) entitycounter FROM study WHERE identity IN (17) GROUP BY identity;
*************************** 1. row ***************************
     identity: 17
       cntent: 17
entitycounter: 0
1 row in set (0,13 sec)
mysql> exit
Bye
$searchd --version
Manticore 3.4.3 2e1484c@200701 release

@mvollsen
Copy link
Author

mvollsen commented Jul 7, 2020

mysql> SHOW INDEX statistic1 STATUS;
*************************** 1. row ***************************
Variable_name: index_type
        Value: rt
*************************** 2. row ***************************
Variable_name: indexed_documents
        Value: 2
*************************** 3. row ***************************
Variable_name: indexed_bytes
        Value: 0
*************************** 4. row ***************************
Variable_name: ram_bytes
        Value: 7304
*************************** 5. row ***************************
Variable_name: disk_bytes
        Value: 2815
*************************** 6. row ***************************
Variable_name: ram_chunk
        Value: 1424
*************************** 7. row ***************************
Variable_name: ram_chunk_segments_count
        Value: 2
*************************** 8. row ***************************
Variable_name: disk_chunks
        Value: 0
*************************** 9. row ***************************
Variable_name: mem_limit
        Value: 536870912
*************************** 10. row ***************************
Variable_name: ram_bytes_retired
        Value: 0
*************************** 11. row ***************************
Variable_name: tid
        Value: 11
*************************** 12. row ***************************
Variable_name: tid_saved
        Value: 11
*************************** 13. row ***************************
Variable_name: query_time_1min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 14. row ***************************
Variable_name: query_time_5min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 15. row ***************************
Variable_name: query_time_15min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 16. row ***************************
Variable_name: query_time_total
        Value: {"queries":26, "avg_sec":0.000, "min_sec":0.000, "max_sec":0.001, "pct95_sec":0.000, "pct99_sec":0.001}
*************************** 17. row ***************************
Variable_name: found_rows_1min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 18. row ***************************
Variable_name: found_rows_5min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 19. row ***************************
Variable_name: found_rows_15min
        Value: {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}
*************************** 20. row ***************************
Variable_name: found_rows_total
        Value: {"queries":26, "avg":1, "min":0, "max":2, "pct95":2, "pct99":2}
20 rows in set (0,00 sec)

@sanikolaev
Copy link
Collaborator

sanikolaev commented Jul 7, 2020

Simpler reproducible case for the wrong COUNT(DISTINCT).

create table t(f text, j json);
insert into t(id,j) values(1,'{"id":6}');
insert into t(id,j) values(2,'{"id":7}');
select j.id i, count(distinct i) from t;
+------+-------------------+
| i    | count(distinct i) |
+------+-------------------+
| 7    |                 1 |
+------+-------------------+
1 row in set (0.00 sec)

FLUSH RAMCHUNK solves the issue:

mysql> FLUSH RAMCHUNK t;
Query OK, 0 rows affected (0.05 sec)

mysql> select j.id i, count(distinct i) from t;
+------+-------------------+
| i    | count(distinct i) |
+------+-------------------+
| 7    |                 2 |
+------+-------------------+
1 row in set (0.00 sec)

If I insert the both documents in the same batch it's also fine.

@githubmanticore
Copy link
Contributor

➤ Ilya Kuznetsov commented:

Fixed count(distinct) on json attributes in 9bc5c01

@glookka glookka closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants