Use Parquet column index when reading table in Iceberg by ebyhr · Pull Request #12977 · trinodb/trino

ebyhr · 2022-06-24T12:17:39Z

Description

Use Parquet column index when reading table in Iceberg
Fixes #11000

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

raunaqmorarka

Are we impacted by any of the problems mentioned in apache/iceberg#193 ?
Note that we rely on parquet filter APIs in our parquet reader to get the row ranges to be read from column index (ColumnIndexFilter.calculateRowRanges).

raunaqmorarka · 2022-06-25T04:29:10Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergParquetPageSkipping.java

I wonder if we should wait to have native writer support for page indexes before we implement this.
Otherwise users don't have a straightforward way to make use of this functionality.

Does the Hive connector write them? We could create / insert data into a table from Hive and then migrate it to an Iceberg table like we do here: https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java#L1305

Yes, the current default parquet writer in hive connector writes them, although that would stop once we change the default to native parquet writer (unless we also implement it there).
While that is a better way of testing, it's still difficult for end users to benefit from this feature.

It is not ideal, but testing the feature is a bit of a chicken/egg problem unless we put both in at the same time.

If we add page indexes support to native parquet writer, we can test reads from it via hive and delta connector, so we don't need to do both at the same time.

I don't mind suspending this PR until native Parquet writer support page indexes.

raunaqmorarka · 2022-06-25T04:31:23Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

The version of predicateMatches without columnIndex in PredicateUtils should be removed now

raunaqmorarka · 2022-06-25T04:33:02Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

Although we won't read the column index from file until later, it is better to avoid creating columnIndex until it's needed (after the start <= firstDataPage && firstDataPage < start + length). We can make same change in ParquetPageSourceFactory as well.

raunaqmorarka · 2022-06-25T04:36:15Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

Could you add a bit of rationale to this commit about how it's required for page indexing feature ?

I am curious why do we want this, and how does it relate.

Native Parquet writer doesn't support writing column index, so we need to use legacy writer. However, the legacy writer doesn't write field-id and the connector can't read the fields correctly.

~~I think it would be better to suspend this PR until native Parquet writer improvement.~~
We could revert this change and modify the table property using Iceberg library within the test.

findepi · 2022-06-28T11:57:25Z

lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriterOptions.java

do we want to allow setting row group size to >2GB?

findepi · 2022-06-28T11:58:18Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestParquetPageSkipping.java

Rename TestParquetPageSkipping to TestHiveTestParquetPageSkipping

this is not a "rename class" change.
it's more extraction of a test base class.

findepi · 2022-06-28T11:59:08Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

I am curious why do we want this, and how does it relate.

The getter is already long type.

Additionally, extract a method to return table definition to allow running in Iceberg connector.

ebyhr · 2022-08-16T00:31:11Z

Let's continue in #13584

cla-bot bot added the cla-signed label Jun 24, 2022

findepi requested review from alexjo2144, raunaqmorarka and skrzypo987 June 24, 2022 12:21

github-actions bot added the tests:hive label Jun 24, 2022

ebyhr force-pushed the ebi/iceberg-parquet-column-index branch from a9e5876 to 666cc52 Compare June 25, 2022 03:34

raunaqmorarka reviewed Jun 25, 2022

View reviewed changes

findepi reviewed Jun 28, 2022

View reviewed changes

ebyhr mentioned this pull request Aug 10, 2022

Add Parquet column index filtering to Iceberg #13584

Closed

ebyhr added 3 commits August 10, 2022 10:17

Change maxRowGroupSize field to long in ParquetWriterOptions

5edd4e3

The getter is already long type.

Rename TestParquetPageSkipping to TestHiveTestParquetPageSkipping

b05bab0

Additionally, extract a method to return table definition to allow running in Iceberg connector.

Use Parquet column index when reading table in Iceberg

3a5a15a

ebyhr force-pushed the ebi/iceberg-parquet-column-index branch from 666cc52 to 3a5a15a Compare August 10, 2022 03:04

ebyhr closed this Aug 16, 2022

ebyhr deleted the ebi/iceberg-parquet-column-index branch August 16, 2022 00:31

Conversation

ebyhr commented Jun 24, 2022

Description

Documentation

Release notes

Uh oh!

raunaqmorarka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebyhr commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

ebyhr Jun 28, 2022 •

edited

Loading