Add Parquet column index filtering to Iceberg by electrum · Pull Request #13584 · trinodb/trino

electrum · 2022-08-10T00:36:54Z

Description

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Related issues, pull requests, and links

Fixes #11000

Documentation

(x) No documentation is needed.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg connector
* Improve performance of querying Parquet data for files containing column indexes. ({issue}`13584`)

ebyhr · 2022-08-10T00:41:54Z

There's an existing PR #12977. I don't mind closing my PR, but probably we need to find way to add tests in this PR.

raunaqmorarka · 2022-08-10T02:23:39Z

The native parquet writer doesn't produce page indexes, and that's the only writer in iceberg connector, so we won't benefit from this unless another engine writes the page indexes.
This PR is missing tests with page indexes in iceberg connector. Maybe we can produce a file with page indexes offline and use that for testing ?
Are we impacted by any of the problems mentioned in apache/iceberg#193 ?
Note that we rely on parquet filter APIs in our parquet reader to get the row ranges to be read from column index (ColumnIndexFilter.calculateRowRanges).

raunaqmorarka · 2022-08-10T02:25:26Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+            ImmutableList.Builder<Optional<ColumnIndexStore>> columnIndexes = ImmutableList.builder();
            for (BlockMetaData block : parquetMetadata.getBlocks()) {
                long firstDataPage = block.getColumns().get(0).getFirstDataPageOffset();
+                Optional<ColumnIndexStore> columnIndex = getColumnIndexStore(dataSource, block, descriptorsByPath, parquetTupleDomain, options);


Although we won't read the column index from file until later, it would be better to avoid creating columnIndex until it's needed (after the start <= firstDataPage && firstDataPage < start + length). We can make same change in ParquetPageSourceFactory as well.

This will change the indentation of the block and make the diff harder to read, and is unrelated to the Iceberg change, so let's do that as a follow up.

ebyhr · 2022-08-10T03:06:47Z

I updated 3a5a15a that includes a generated Parquet file. Please feel free to pick up the commit.

osscm · 2022-09-28T07:52:31Z

@electrum thanks for helping to add this feature!
Wondering, if we are we planning to add this in the next release, thanks.

osscm · 2022-10-13T03:06:00Z

I updated 3a5a15a that includes a generated Parquet file. Please feel free to pick up the commit.

thanks @ebyhr!

Looks like you have a test case with the generated file as well: https://github.com/trinodb/trino/blob/3a5a15a53ba5e639287d073f661945672f5f6bc3/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergParquetPageSkipping.java

Are there any other test that needs to be added? I'll be happy to work.

osscm · 2022-10-14T00:41:50Z

I'm trying to backport to 391 and test it.
I tried to run the tests
few of them were failing, there might be some incompatibility with the 391 Parquet reader/writer.

raunaqmorarka · 2022-10-14T09:42:21Z

I'm trying to backport to 391 and test it. I tried to run the tests few of them were failing, there might be some incompatibility with the 391 Parquet reader/writer.

Those tests fail because the native trino parquet writer used by iceberg connector does not write page indexes to file yet.
The main blocker here is a resolution to apache/iceberg#193
In practise I have not found page indexes to improve performance much because page level min/max indexes are not very selective unless the data is sorted. If the data is sorted, then row group pruning already provides most of the benefit.
When multiple columns are read with a selective predicate on one of columns, then the lazy loading of blocks in the engine already allows orc/parquet readers to skip decoding for filtered rows of the remaining columns after a selective predicate is applied to a column.
Though page indexes may eliminate reads from S3 for a subset of the parquet pages in a column chunk, in practise reads from nearby positions in a file are often coalesced by the orc/parquet reader to avoid making multiple small reads from S3, thereby often limiting the benefit to just saving decompression of eliminated pages.
Reading page indexes also incurs additional S3 requests because they are not part of the footer, the footer has just references to them. So on the whole, page indexes does not significantly improve perf even though it sounds like a valuable feature in theory.

osscm · 2022-10-15T01:40:03Z

I'm trying to backport to 391 and test it. I tried to run the tests few of them were failing, there might be some incompatibility with the 391 Parquet reader/writer.

Those tests fail because the native trino parquet writer used by iceberg connector does not write page indexes to file yet. The main blocker here is a resolution to apache/iceberg#193 In practise I have not found page indexes to improve performance much because page level min/max indexes are not very selective unless the data is sorted. If the data is sorted, then row group pruning already provides most of the benefit. When multiple columns are read with a selective predicate on one of columns, then the lazy loading of blocks in the engine already allows orc/parquet readers to skip decoding for filtered rows of the remaining columns after a selective predicate is applied to a column. Though page indexes may eliminate reads from S3 for a subset of the parquet pages in a column chunk, in practise reads from nearby positions in a file are often coalesced by the orc/parquet reader to avoid making multiple small reads from S3, thereby often limiting the benefit to just saving decompression of eliminated pages. Reading page indexes also incurs additional S3 requests because they are not part of the footer, the footer has just references to them. So on the whole, page indexes does not significantly improve perf even though it sounds like a valuable feature in theory.

Thanks, @raunaqmorarka for the detailed response!
In that case, as Bloom Filter is being supported by Iceberg+Spark for the Parquet file format, it will be worth supporting Trino reads to have that support as well. As Bloom filter can provide a better performance (based on the cases), especially for high cardinality columns.

raunaqmorarka · 2022-10-17T12:14:06Z

Thanks, @raunaqmorarka for the detailed response! In that case, as Bloom Filter is being supported by Iceberg+Spark for the Parquet file format, it will be worth supporting Trino reads to have that support as well. As Bloom filter can provide a better performance (based on the cases), especially for high cardinality columns.

There is an open PR about that #14428

mosabua · 2024-01-12T22:32:31Z

@electrum @raunaqmorarka @ebyhr @osscm .. is this still in progress or replaced by some other work?

mwong77 · 2024-05-20T19:57:55Z

After cherry-picking the commits in this PR, I want to bring up an issue that was discovered. I performed the following steps:

Create an iceberg table partitioned by a single column and insert some initial data rows.
Add a new partition to the iceberg table (schema evolution) and insert some more data rows.
Run a simple SELECT query using the new partition column in a filter predicate.

I observed that no rows were returned by the query ran in step 3. Digging a bit more into the Trino code, I can see the following:

In the IcebergPageSourceProvider class, the filter predicate with the new partition column gets treated as an unenforced predicate even though the iceberg table defines the column (that’s used in the filter) as a partition. Ideally, the partitioning column should be an enforced predicate (from my current understanding). The reason why this partition column is not treated as an enforced predicate is because the canEnforceColumnConstraintInSpecs function will only return true and add it to the enforced predicates list if all iceberg partitioning specs contain the new partitioning column (in the filter predicate). In my table, I have two partitioning specs due to evolving the iceberg schema and the partition column in my filter predicate was only added to the second iceberg partitioning spec. Now Trino will construct the column index object based on the columns in the unenforced predicate(effectivePredicate was used to construct the parquetTupleDomain which is used to construct the columnIndex).

After the column index object is constructed and Trino uses it to match the different blocks in the parquet file, Trino will proceed to configure the fields of the ParquetReader here. As I look closer, the partition column will not be added to the parquetColumnFieldsBuilder because the partition if-condition here is executed first which doesn’t add the partition column to the parquetColumnFieldsBuilder field which is solely done here. As a result, when the ParquetReader class filters row ranges in the Parquet blocks here, the function will return no rows because of the following code in the ColumnIndexFilter class which is called here:

// In ColumnIndexFilter class of hive parquet jar
public static RowRanges calculateRowRanges(FilterCompat.Filter filter, final ColumnIndexStore columnIndexStore, final Set<ColumnPath> paths, final long rowCount) {
    return (RowRanges)filter.accept(new FilterCompat.Visitor<RowRanges>() {
      public RowRanges visit(FilterCompat.FilterPredicateCompat filterPredicateCompat) {
        try {
          return (RowRanges)filterPredicateCompat.getFilterPredicate().accept(new ColumnIndexFilter(columnIndexStore, paths, rowCount));
        } catch (ColumnIndexStore.MissingOffsetIndexException var3) {
          ColumnIndexStore.MissingOffsetIndexException e = var3;
          ColumnIndexFilter.LOGGER.info(e.getMessage());
          return RowRanges.createSingle(rowCount);
        }
      }

      public RowRanges visit(FilterCompat.UnboundRecordFilterCompat unboundRecordFilterCompat) {
        return RowRanges.createSingle(rowCount);
      }

      public RowRanges visit(FilterCompat.NoOpFilter noOpFilter) {
        return RowRanges.createSingle(rowCount);
      }
    });
  }

public <T extends Comparable<T>, U extends UserDefinedPredicate<T>> RowRanges visit(Operators.UserDefined<T, U> udp) {  
  return this.applyPredicate(udp.getColumn(), (ci) -> {  
    return (PrimitiveIterator.OfInt)ci.visit(udp);  
  }, udp.getUserDefinedPredicate().acceptsNullValue() ? this.allRows() : RowRanges.EMPTY);  
}


private RowRanges applyPredicate(Operators.Column<?> column, Function<ColumnIndex, PrimitiveIterator.OfInt> func, RowRanges rangesForMissingColumns) {  
  ColumnPath columnPath = column.getColumnPath();  
  if (!this.columns.contains(columnPath)) {  
    return rangesForMissingColumns;  
  } else {  
    OffsetIndex oi = this.columnIndexStore.getOffsetIndex(columnPath);  
    ColumnIndex ci = this.columnIndexStore.getColumnIndex(columnPath);  
    if (ci == null) {  
      LOGGER.info("No column index for column {} is available; Unable to filter on this column", columnPath);  
      return this.allRows();  
    } else {  
      return RowRanges.create(this.rowCount, (PrimitiveIterator.OfInt)func.apply(ci), oi);  
    }  
  }  
}

In the above code, udp contains the user defined predicate (filter predicate using the partition column) and the this.columns contains the list of columns added to parquetColumnFieldsBuilder in the IcebergPageSourceProvider class. Since the parquetColumnFieldsBuilder does not contain the partition column from the filter predicate, return rangesForMissingColumns; is executed and since my user defined predicate does not accept null value, RowRanges.EMPTY is returned. Thus, zero rows are returned when a new partition column is added to an iceberg table and then a query is run that uses that new partition column in a filter predicate.

github-actions · 2024-09-04T17:07:12Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

mosabua · 2024-09-18T17:30:52Z

@electrum is this something you are still pursuing?

mosabua · 2024-09-18T17:39:55Z

@cwsteinbach @alexjo2144 @findinpath ... I checked with @electrum and it would be good if someone from the team could pick this up.

github-actions · 2024-10-11T17:03:13Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

cla-bot bot added the cla-signed label Aug 10, 2022

github-actions bot added the tests:hive label Aug 10, 2022

dain approved these changes Aug 10, 2022

View reviewed changes

electrum added 2 commits August 9, 2022 17:52

Remove unused constructor from ParquetPageSource

1cf888b

Move factory method to TrinoColumnIndexStore

c300261

electrum force-pushed the iceberg-parquet branch from e941993 to 3205ed7 Compare August 10, 2022 00:57

Use Parquet column index when reading table in Iceberg

1c32786

electrum force-pushed the iceberg-parquet branch from 3205ed7 to 1c32786 Compare August 10, 2022 01:06

dain approved these changes Aug 10, 2022

View reviewed changes

raunaqmorarka reviewed Aug 10, 2022

View reviewed changes

ebyhr mentioned this pull request Aug 16, 2022

Use Parquet column index when reading table in Iceberg #12977

Closed

electrum mentioned this pull request Aug 23, 2022

Add predicate to ParquetReader constructor in Iceberg #13804

Merged

ebyhr mentioned this pull request Nov 4, 2022

Pushdown predicates to the parquet reader in the iceberg connector #14892

Closed

kokosing force-pushed the master branch from 3f05134 to 58d6356 Compare March 14, 2023 11:34

raunaqmorarka mentioned this pull request Aug 17, 2023

Remove duplicate property documentation #18708

Merged

findepi removed the tests:hive label Apr 18, 2024

github-actions bot added the stale label Sep 4, 2024

github-actions bot removed the stale label Sep 19, 2024

github-actions bot added the stale label Oct 11, 2024

electrum closed this Oct 11, 2024

electrum deleted the iceberg-parquet branch October 11, 2024 18:25

Conversation

electrum commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

ebyhr commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raunaqmorarka commented Aug 10, 2022

Uh oh!

raunaqmorarka Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

electrum Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

ebyhr commented Aug 10, 2022

Uh oh!

osscm commented Sep 28, 2022

Uh oh!

osscm commented Oct 13, 2022

Uh oh!

osscm commented Oct 14, 2022

Uh oh!

raunaqmorarka commented Oct 14, 2022

Uh oh!

osscm commented Oct 15, 2022

Uh oh!

raunaqmorarka commented Oct 17, 2022

Uh oh!

mosabua commented Jan 12, 2024

Uh oh!

mwong77 commented May 20, 2024

Uh oh!

github-actions bot commented Sep 4, 2024

Uh oh!

mosabua commented Sep 18, 2024

Uh oh!

mosabua commented Sep 18, 2024

Uh oh!

github-actions bot commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

8 participants

electrum commented Aug 10, 2022 •

edited

Loading

ebyhr commented Aug 10, 2022 •

edited

Loading