Skip to content

Use Parquet column index when reading table in Iceberg#12977

Closed
ebyhr wants to merge 3 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-parquet-column-index
Closed

Use Parquet column index when reading table in Iceberg#12977
ebyhr wants to merge 3 commits intotrinodb:masterfrom
ebyhr:ebi/iceberg-parquet-column-index

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Jun 24, 2022

Description

Use Parquet column index when reading table in Iceberg
Fixes #11000

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 24, 2022
@ebyhr ebyhr force-pushed the ebi/iceberg-parquet-column-index branch from a9e5876 to 666cc52 Compare June 25, 2022 03:34
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we impacted by any of the problems mentioned in apache/iceberg#193 ?
Note that we rely on parquet filter APIs in our parquet reader to get the row ranges to be read from column index (ColumnIndexFilter.calculateRowRanges).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should wait to have native writer support for page indexes before we implement this.
Otherwise users don't have a straightforward way to make use of this functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Hive connector write them? We could create / insert data into a table from Hive and then migrate it to an Iceberg table like we do here: https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java#L1305

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current default parquet writer in hive connector writes them, although that would stop once we change the default to native parquet writer (unless we also implement it there).
While that is a better way of testing, it's still difficult for end users to benefit from this feature.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not ideal, but testing the feature is a bit of a chicken/egg problem unless we put both in at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add page indexes support to native parquet writer, we can test reads from it via hive and delta connector, so we don't need to do both at the same time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind suspending this PR until native Parquet writer support page indexes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of predicateMatches without columnIndex in PredicateUtils should be removed now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we won't read the column index from file until later, it is better to avoid creating columnIndex until it's needed (after the start <= firstDataPage && firstDataPage < start + length). We can make same change in ParquetPageSourceFactory as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a bit of rationale to this commit about how it's required for page indexing feature ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why do we want this, and how does it relate.

Copy link
Copy Markdown
Member Author

@ebyhr ebyhr Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native Parquet writer doesn't support writing column index, so we need to use legacy writer. However, the legacy writer doesn't write field-id and the connector can't read the fields correctly.

I think it would be better to suspend this PR until native Parquet writer improvement.
We could revert this change and modify the table property using Iceberg library within the test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to allow setting row group size to >2GB?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename TestParquetPageSkipping to TestHiveTestParquetPageSkipping

this is not a "rename class" change.
it's more extraction of a test base class.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why do we want this, and how does it relate.

ebyhr added 3 commits August 10, 2022 10:17
Additionally, extract a method to return table definition
to allow running in Iceberg connector.
@ebyhr ebyhr force-pushed the ebi/iceberg-parquet-column-index branch from 666cc52 to 3a5a15a Compare August 10, 2022 03:04
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Aug 16, 2022

Let's continue in #13584

@ebyhr ebyhr closed this Aug 16, 2022
@ebyhr ebyhr deleted the ebi/iceberg-parquet-column-index branch August 16, 2022 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Skip reading Parquet pages using Column Indexes for Iceberg

4 participants