Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039

Merged
merged 1 commit into from
Apr 26, 2023

Conversation

kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Apr 25, 2023

Proposed changes

Problem summary

Close #19038

We found qt_q11 in regression test test_external_catalog_hive is very slow.
The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.

So there are two issues:

  1. Skip whole row group in this case.
  2. Prefetching data in this case may be not good, need to improve it.

This PR resolve issues 1.

Test result:

Before opt:

mysql> select l_quantity from test_external_catalog_hive.tpch_1000_parquet.lineitem where l_orderkey = 599614241 and l_partkey = 59018738 and l_suppkey = 1518744 limit 2;
+------------+
| l_quantity |
+------------+
|      16.00 |
+------------+
1 row in set (2 min 27.55 sec)

After opt:

mysql> select l_quantity from test_external_catalog_hive.tpch_1000_parquet.lineitem where l_orderkey = 599614241 and l_partkey = 59018738 and l_suppkey = 1518744 limit 2;
+------------+
| l_quantity |
+------------+
|      16.00 |
+------------+
1 row in set (41.95 sec)

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@AshinGau
Copy link
Member

LGTM

@kaka11chen
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 33.78 seconds
stream load tsv: 424 seconds loaded 74807831229 Bytes, about 168 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230425070905_clickbench_pr_134625.html

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 25, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit 5bd4a38 into apache:master Apr 26, 2023
gnehil pushed a commit to gnehil/doris that referenced this pull request Apr 27, 2023
…as been filtered. (apache#19039)

We found qt_q11 in regression test test_external_catalog_hive is very slow.
The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.

So there are two issues:

Skip whole row group in this case.
Prefetching data in this case may be not good, need to improve it.
This PR resolve issues 1.
morningman pushed a commit that referenced this pull request May 7, 2023
…ype in some cases. (#19348)

Fix dict cols not be converted back to string type in some cases, which includes introduced by #19039.
For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. 
The block will be reuse it, so it is necessary to convert it back.
Reminiscent pushed a commit to Reminiscent/doris that referenced this pull request May 15, 2023
…as been filtered. (apache#19039)

We found qt_q11 in regression test test_external_catalog_hive is very slow.
The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.

So there are two issues:

Skip whole row group in this case.
Prefetching data in this case may be not good, need to improve it.
This PR resolve issues 1.
Reminiscent pushed a commit to Reminiscent/doris that referenced this pull request May 15, 2023
…ype in some cases. (apache#19348)

Fix dict cols not be converted back to string type in some cases, which includes introduced by apache#19039.
For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. 
The block will be reuse it, so it is necessary to convert it back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/vectorization reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] qt_q11 in regression test test_external_catalog_hive is very slow.
5 participants