[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

kaka11chen · 2023-04-25T05:34:31Z

Proposed changes

Problem summary

Close #19038

We found qt_q11 in regression test test_external_catalog_hive is very slow.
The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.

So there are two issues:

Skip whole row group in this case.
Prefetching data in this case may be not good, need to improve it.

This PR resolve issues 1.

Test result:

Before opt:

mysql> select l_quantity from test_external_catalog_hive.tpch_1000_parquet.lineitem where l_orderkey = 599614241 and l_partkey = 59018738 and l_suppkey = 1518744 limit 2;
+------------+
| l_quantity |
+------------+
|      16.00 |
+------------+
1 row in set (2 min 27.55 sec)

After opt:

mysql> select l_quantity from test_external_catalog_hive.tpch_1000_parquet.lineitem where l_orderkey = 599614241 and l_partkey = 59018738 and l_suppkey = 1518744 limit 2;
+------------+
| l_quantity |
+------------+
|      16.00 |
+------------+
1 row in set (41.95 sec)

Checklist(Required)

Does it affect the original behavior
Has unit tests been added
Has document been added or modified
Does it need to update dependencies
Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

…as been filtered.

github-actions · 2023-04-25T05:38:16Z

clang-tidy review says "All clean, LGTM! 👍"

be/src/vec/exec/format/parquet/vparquet_group_reader.cpp

AshinGau · 2023-04-25T05:55:05Z

LGTM

kaka11chen · 2023-04-25T06:33:44Z

run buildall

hello-stephen · 2023-04-25T07:09:08Z

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 33.78 seconds
stream load tsv: 424 seconds loaded 74807831229 Bytes, about 168 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230425070905_clickbench_pr_134625.html

morningman

LGTM

github-actions · 2023-04-25T07:12:29Z

PR approved by at least one committer and no changes requested.

github-actions · 2023-04-25T07:12:32Z

PR approved by anyone and no changes requested.

…as been filtered. (apache#19039) We found qt_q11 in regression test test_external_catalog_hive is very slow. The result is only one record, so other data should be filtered out in the parquet lazy read situation. Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good. So there are two issues: Skip whole row group in this case. Prefetching data in this case may be not good, need to improve it. This PR resolve issues 1.

…ype in some cases. (#19348) Fix dict cols not be converted back to string type in some cases, which includes introduced by #19039. For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. The block will be reuse it, so it is necessary to convert it back.

…as been filtered. (apache#19039) We found qt_q11 in regression test test_external_catalog_hive is very slow. The result is only one record, so other data should be filtered out in the parquet lazy read situation. Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good. So there are two issues: Skip whole row group in this case. Prefetching data in this case may be not good, need to improve it. This PR resolve issues 1.

…ype in some cases. (apache#19348) Fix dict cols not be converted back to string type in some cases, which includes introduced by apache#19039. For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. The block will be reuse it, so it is necessary to convert it back.

[optimize](multi-catalog) Skip whole row group in lazy_read if data h…

7d47a84

…as been filtered.

github-actions bot added the area/vectorization label Apr 25, 2023

AshinGau reviewed Apr 25, 2023

View reviewed changes

be/src/vec/exec/format/parquet/vparquet_group_reader.cpp Show resolved Hide resolved

morningman approved these changes Apr 25, 2023

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 25, 2023

github-actions bot added the reviewed label Apr 25, 2023

yiguolei merged commit 5bd4a38 into apache:master Apr 26, 2023

kaka11chen mentioned this pull request May 6, 2023

[Fix](parquet-reader) Fix dict cols not be converted back to string type in some cases. #19348

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

kaka11chen commented Apr 25, 2023 •

edited

Loading

github-actions bot commented Apr 25, 2023

AshinGau commented Apr 25, 2023

kaka11chen commented Apr 25, 2023

hello-stephen commented Apr 25, 2023

morningman left a comment

github-actions bot commented Apr 25, 2023

github-actions bot commented Apr 25, 2023

[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

Conversation

kaka11chen commented Apr 25, 2023 • edited Loading

Proposed changes

Problem summary

Test result:

Checklist(Required)

Further comments

github-actions bot commented Apr 25, 2023

AshinGau commented Apr 25, 2023

kaka11chen commented Apr 25, 2023

hello-stephen commented Apr 25, 2023

morningman left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 25, 2023

github-actions bot commented Apr 25, 2023

kaka11chen commented Apr 25, 2023 •

edited

Loading