-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Pushed filters to Parquet file on best effort basis in Vectorized Reader #9479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ajantha-bhat can you check and help on this |
|
Can we execute spotless I haven't worked closely in this module of Iceberg. But I will take a look. I will also tag others for review. |
|
@ajantha-bhat , ack, will fix UTs and comment here |
|
@ajantha-bhat fixed the test cases and applied spotless |
|
I think still there are checkstyle failure, revAPI failure (since public method params changed) and a test case failure. Can also locally run |
ajantha-bhat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent little time and understood this PR.
Basically you want to enable record level filtering (and you have observed the benefits with this POC PR) for vector reader instead of high level row-group filtering.
You want to know why it is not enabled for vector reader but enabled for plain ParquetReadBuilder without reader functions (readerFunc and batchedReaderFunc).
I don't have answer for this history. I can check in iceberg slack or mailing list.
We might have to run all benchmarks and understand whether any side effects of enabling record level filtering.
| import org.apache.parquet.hadoop.ParquetFileReader; | ||
| import org.apache.parquet.hadoop.ParquetFileWriter; | ||
| import org.apache.parquet.hadoop.ParquetOutputFormat; | ||
| import org.apache.parquet.hadoop.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you check compile you will have checkstyle error for this. Iceberg doesn't use * import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat yes I have fixed this in my local, fixing 1 Integration test case, then will push it again
| boolean pushedFilters = true; | ||
| ParquetReadOptions options = optionsBuilder.build(); | ||
|
|
||
| if (filter != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is code duplication from below block. We can extract the common logic to form ParquetFilters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
Yes @ajantha-bhat , we were running TPCDS benchmarks there we saw in some places, this was causing some degradation while in some places, it was improving the results. That's why we wanted to understand once whether we are going in right direction with this or not |
|
@mudit-97: I did ask the question on iceberg slack (dev channel), feel free to join the channel. I am yet to go through and summarise it. |
|
@ajantha-bhat thanks for the help, I will also join the channel |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Hi all,
We are planning to onboard Iceberg as our Tableformat stack in our Data Platform. For that purpose, we were doing some benchmarking between Hudi and Iceberg vs Vanilla Parquet based Dataframe reads in spark and we found out that Iceberg is not pushing the filters to the parquet layer and it is relying on row group based filtering which is a customized pattern for Iceberg
However we saw that amount of data read will be higher in that case, we tried to push down filters till the parquet layer also and we saw benefits in some queries.
We wanted to know the comments on this PR whether this makes sense to the community or should this be avoided and rely on only row group filtering for Iceberg