-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039
[optimize](parquet-reader) Skip whole row group in the parquet lazy read situation if data has been filtered out. #19039
Conversation
…as been filtered.
clang-tidy review says "All clean, LGTM! 👍" |
LGTM |
run buildall |
TeamCity pipeline, clickbench performance test result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
…as been filtered. (apache#19039) We found qt_q11 in regression test test_external_catalog_hive is very slow. The result is only one record, so other data should be filtered out in the parquet lazy read situation. Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good. So there are two issues: Skip whole row group in this case. Prefetching data in this case may be not good, need to improve it. This PR resolve issues 1.
…ype in some cases. (#19348) Fix dict cols not be converted back to string type in some cases, which includes introduced by #19039. For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. The block will be reuse it, so it is necessary to convert it back.
…as been filtered. (apache#19039) We found qt_q11 in regression test test_external_catalog_hive is very slow. The result is only one record, so other data should be filtered out in the parquet lazy read situation. Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good. So there are two issues: Skip whole row group in this case. Prefetching data in this case may be not good, need to improve it. This PR resolve issues 1.
…ype in some cases. (apache#19348) Fix dict cols not be converted back to string type in some cases, which includes introduced by apache#19039. For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. The block will be reuse it, so it is necessary to convert it back.
Proposed changes
Problem summary
Close #19038
We found
qt_q11
in regression testtest_external_catalog_hive
is very slow.The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.
So there are two issues:
This PR resolve issues 1.
Test result:
Before opt:
After opt:
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...