Use Parquet column index when reading table in Iceberg#12977
Use Parquet column index when reading table in Iceberg#12977ebyhr wants to merge 3 commits intotrinodb:masterfrom
Conversation
a9e5876 to
666cc52
Compare
raunaqmorarka
left a comment
There was a problem hiding this comment.
Are we impacted by any of the problems mentioned in apache/iceberg#193 ?
Note that we rely on parquet filter APIs in our parquet reader to get the row ranges to be read from column index (ColumnIndexFilter.calculateRowRanges).
There was a problem hiding this comment.
I wonder if we should wait to have native writer support for page indexes before we implement this.
Otherwise users don't have a straightforward way to make use of this functionality.
There was a problem hiding this comment.
Does the Hive connector write them? We could create / insert data into a table from Hive and then migrate it to an Iceberg table like we do here: https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java#L1305
There was a problem hiding this comment.
Yes, the current default parquet writer in hive connector writes them, although that would stop once we change the default to native parquet writer (unless we also implement it there).
While that is a better way of testing, it's still difficult for end users to benefit from this feature.
There was a problem hiding this comment.
It is not ideal, but testing the feature is a bit of a chicken/egg problem unless we put both in at the same time.
There was a problem hiding this comment.
If we add page indexes support to native parquet writer, we can test reads from it via hive and delta connector, so we don't need to do both at the same time.
There was a problem hiding this comment.
I don't mind suspending this PR until native Parquet writer support page indexes.
There was a problem hiding this comment.
The version of predicateMatches without columnIndex in PredicateUtils should be removed now
There was a problem hiding this comment.
Although we won't read the column index from file until later, it is better to avoid creating columnIndex until it's needed (after the start <= firstDataPage && firstDataPage < start + length). We can make same change in ParquetPageSourceFactory as well.
There was a problem hiding this comment.
Could you add a bit of rationale to this commit about how it's required for page indexing feature ?
There was a problem hiding this comment.
I am curious why do we want this, and how does it relate.
There was a problem hiding this comment.
Native Parquet writer doesn't support writing column index, so we need to use legacy writer. However, the legacy writer doesn't write field-id and the connector can't read the fields correctly.
I think it would be better to suspend this PR until native Parquet writer improvement.
We could revert this change and modify the table property using Iceberg library within the test.
There was a problem hiding this comment.
do we want to allow setting row group size to >2GB?
There was a problem hiding this comment.
Rename TestParquetPageSkipping to TestHiveTestParquetPageSkipping
this is not a "rename class" change.
it's more extraction of a test base class.
There was a problem hiding this comment.
I am curious why do we want this, and how does it relate.
The getter is already long type.
Additionally, extract a method to return table definition to allow running in Iceberg connector.
666cc52 to
3a5a15a
Compare
|
Let's continue in #13584 |
Description
Use Parquet column index when reading table in Iceberg
Fixes #11000
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: