Skip to content

Conversation

@mudit-97
Copy link

Hi all,

We are planning to onboard Iceberg as our Tableformat stack in our Data Platform. For that purpose, we were doing some benchmarking between Hudi and Iceberg vs Vanilla Parquet based Dataframe reads in spark and we found out that Iceberg is not pushing the filters to the parquet layer and it is relying on row group based filtering which is a customized pattern for Iceberg

However we saw that amount of data read will be higher in that case, we tried to push down filters till the parquet layer also and we saw benefits in some queries.

We wanted to know the comments on this PR whether this makes sense to the community or should this be avoided and rely on only row group filtering for Iceberg

@mudit-97
Copy link
Author

@Fokko @nastra can you please check this once or tag relevant folks if possible

@gvramana
Copy link

@ajantha-bhat can you check and help on this

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Jan 18, 2024

Can we execute spotless ./gradlew spotlessApply and also fix the test case failure?

I haven't worked closely in this module of Iceberg. But I will take a look. I will also tag others for review.
If no feedback on the PR this week, please open a mailing list discussion on the same.

@mudit-97
Copy link
Author

@ajantha-bhat , ack, will fix UTs and comment here

@mudit-97
Copy link
Author

@ajantha-bhat fixed the test cases and applied spotless

@ajantha-bhat
Copy link
Member

I think still there are checkstyle failure, revAPI failure (since public method params changed) and a test case failure.

Can also locally run ./gradlew clean build -x test -x integrationTest to first have a clean build before pushing.

Copy link
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent little time and understood this PR.

Basically you want to enable record level filtering (and you have observed the benefits with this POC PR) for vector reader instead of high level row-group filtering.

You want to know why it is not enabled for vector reader but enabled for plain ParquetReadBuilder without reader functions (readerFunc and batchedReaderFunc).

I don't have answer for this history. I can check in iceberg slack or mailing list.
We might have to run all benchmarks and understand whether any side effects of enabling record level filtering.

import org.apache.parquet.hadoop.ParquetFileReader;
import org.apache.parquet.hadoop.ParquetFileWriter;
import org.apache.parquet.hadoop.ParquetOutputFormat;
import org.apache.parquet.hadoop.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you check compile you will have checkstyle error for this. Iceberg doesn't use * import.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat yes I have fixed this in my local, fixing 1 Integration test case, then will push it again

boolean pushedFilters = true;
ParquetReadOptions options = optionsBuilder.build();

if (filter != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code duplication from below block. We can extract the common logic to form ParquetFilters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@mudit-97
Copy link
Author

I spent little time and understood this PR.

Basically you want to enable record level filtering (and you have observed the benefits with this POC PR) for vector reader instead of high level row-group filtering.

You want to know why it is not enabled for vector reader but enabled for plain ParquetReadBuilder without reader functions (readerFunc and batchedReaderFunc).

I don't have answer for this history. I can check in iceberg slack or mailing list. We might have to run all benchmarks and understand whether any side effects of enabling record level filtering.

Yes @ajantha-bhat , we were running TPCDS benchmarks there we saw in some places, this was causing some degradation while in some places, it was improving the results. That's why we wanted to understand once whether we are going in right direction with this or not

@ajantha-bhat
Copy link
Member

@mudit-97: I did ask the question on iceberg slack (dev channel), feel free to join the channel.
@amogh-jahagirdar pointed me to the historical PR for the same (#1566).

I am yet to go through and summarise it.

@mudit-97
Copy link
Author

@ajantha-bhat thanks for the help, I will also join the channel

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 13, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants